PoC: Partial sort

Started by Alexander Korotkovabout 12 years ago91 messages
#1Alexander Korotkov
aekorotkov@gmail.com
2 attachment(s)

Hackers!

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

Two attached patches are proof of concept for this approach.

*partial-sort-1.patch*

This patch allows to use index for order-by if order-by clause and index
has non-empty common prefix. So, index gives right ordering for first n
order-by columns. In order to provide right order for rest m columns, sort
node is inserted. This sort node sorts groups of tuples where values of
first n order-by columns are equal.

See an example.

create table test as (select id, (random()*10000)::int as v1, random() as
v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);

We've index by v1 column, but we can get results ordered by v1, v2.

postgres=# select * from test order by v1, v2 limit 10;
id | v1 | v2
--------+----+--------------------
390371 | 0 | 0.0284479795955122
674617 | 0 | 0.0322008323855698
881905 | 0 | 0.042586590629071
972877 | 0 | 0.0531588457524776
364903 | 0 | 0.0594307743012905
82333 | 0 | 0.0666455538012087
266488 | 0 | 0.072808934841305
892215 | 0 | 0.0744258034974337
13805 | 0 | 0.0794667331501842
338435 | 0 | 0.171817752998322
(10 rows)

And it's fast using following plan.

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

For sure, this approach is effective only when first n order-by columns we
selected provides enough count of unique values (so, sorted groups are
small). Patch is only PoC because it doesn't contains any try to estimate
right cost of using partial sort.

*partial-knn-1.patch*

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)

This patch is also only PoC because of following:
1) It's probably wrong at all to get heap tuple from index scan node. This
work should be done from another node.
2) Assumption that order-by operator returns float8 comparable with GiST
distance method result in general case is wrong.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-1.patchapplication/octet-stream; name=partial-sort-1.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 09b2eb0..65bf9fd
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,52 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpTuples(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 	SortSupport sortKeys = tuplesort_get_sortkeys(node->tuplesortstate);
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = sortKeys[i].ssup_attno;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (ApplySortComparator(datumA, isnullA,
+                                   datumB, isnullB,
+                                   &sortKeys[i]))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 54,131 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
! 		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
! 											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
! 			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
! 	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
--- 81,194 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 	PlanState  *outerNode;
! 	TupleDesc	tupDesc;
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	tuplesortstate = tuplesort_begin_heap(tupDesc,
! 										  plannode->numCols,
! 										  plannode->sortColIdx,
! 										  plannode->sortOperators,
! 										  plannode->collations,
! 										  plannode->nullsFirst,
! 										  work_mem,
! 										  node->randomAccess);
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound);
! 	node->tuplesortstate = (void *) tuplesortstate;
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
! 		if (node->prev)
  		{
! 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
! 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
  
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpTuples(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
  		}
+ 		else
+ 		{
+ 			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
+ 				break;
+ 			}
+ 			else
+ 			{
+ 				node->prev = ExecCopySlotTuple(slot);
+ 			}
+ 		}
+ 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
  
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 	node->bound_Done = node->bound;
! 	SO1_printf("ExecSort: %s\n", "sorting done");
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 237,245 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
  
  	/*
  	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index e3edcf6..d698559
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 735,740 ****
--- 735,741 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9c8ede6..067730f
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 312,343 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_fractional_path_for_pathkey
*** 403,409 ****
  			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
  			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
--- 429,435 ----
  			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
  			continue;
  
! 		if (pathkeys_common(pathkeys, path->pathkeys) != 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
*************** right_merge_direction(PlannerInfo *root,
*** 1457,1469 ****
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
  		/* It's useful ... or at least the first N keys are */
  		return list_length(root->query_pathkeys);
--- 1483,1499 ----
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
+ 	int n;
+ 
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n = pathkeys_common(root->query_pathkeys, pathkeys);
! 
! 	if (n != 0)
  	{
  		/* It's useful ... or at least the first N keys are */
  		return list_length(root->query_pathkeys);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index f2c122d..87dd985
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 148,154 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
--- 148,154 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 808,814 ****
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
  		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
! 			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 808,814 ----
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
  		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
! 			subplan = (Plan *) make_sort(root, subplan, numsortkeys, 0,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2186,2192 ****
  			make_sort_from_pathkeys(root,
  									outer_plan,
  									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2186,2192 ----
  			make_sort_from_pathkeys(root,
  									outer_plan,
  									best_path->outersortkeys,
! 									-1.0, 0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2199,2205 ****
  			make_sort_from_pathkeys(root,
  									inner_plan,
  									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2199,2205 ----
  			make_sort_from_pathkeys(root,
  									inner_plan,
  									best_path->innersortkeys,
! 									-1.0, 0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 3738,3744 ****
   * limit_tuples is as for cost_sort (in particular, pass -1 if no limit)
   */
  static Sort *
! make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
--- 3738,3744 ----
   * limit_tuples is as for cost_sort (in particular, pass -1 if no limit)
   */
  static Sort *
! make_sort(PlannerInfo *root, Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3762,3767 ****
--- 3762,3768 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4090,4096 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4091,4097 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4110,4116 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4111,4117 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4153,4159 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4154,4160 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4208,4214 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4209,4215 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 6670794..94cb114
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** grouping_planner(PlannerInfo *root, doub
*** 1360,1367 ****
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1360,1367 ----
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 					pathkeys_common(root->query_pathkeys,
! 									  cheapest_path->pathkeys) != 0)
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1721,1727 ****
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
  					if (!pathkeys_contained_in(window_pathkeys,
  											   current_pathkeys))
  					{
--- 1721,1727 ----
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0, 0);
  					if (!pathkeys_contained_in(window_pathkeys,
  											   current_pathkeys))
  					{
*************** grouping_planner(PlannerInfo *root, doub
*** 1881,1887 ****
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
  															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 1881,1887 ----
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
  															current_pathkeys,
! 															   -1.0, 0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 1897,1908 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 1897,1912 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		int sortLength = list_length(root->sort_pathkeys);
! 		
! 		if (common <= sortLength)
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index ea8af9f..29b90f2
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** free_sort_tuple(Tuplesortstate *state, S
*** 3455,3457 ****
--- 3455,3464 ----
  	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
  	pfree(stup->tuple);
  }
+ 
+ SortSupport
+ tuplesort_get_sortkeys(Tuplesortstate *state)
+ {
+ 	return state->sortKeys;
+ }
+ 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 5a40347..3723a18
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1663,1670 ****
--- 1663,1672 ----
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	bool		finished;
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	HeapTuple	prev;
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 101e22c..28b871e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 582,587 ****
--- 582,588 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 999adaa..7c09301
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 157,162 ****
--- 157,163 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index ba7ae7c..b46d71c
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 50,56 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
--- 50,56 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 25fa6de..267a988
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern void tuplesort_get_stats(Tuplesor
*** 108,113 ****
--- 109,116 ----
  
  extern int	tuplesort_merge_order(int64 allowedMem);
  
+ extern SortSupport tuplesort_get_sortkeys(Tuplesortstate *state);
+ 
  /*
   * These routines may only be called if randomAccess was specified 'true'.
   * Likewise, backwards scan in gettuple/getdatum is only allowed if
partial-knn-1.patchapplication/octet-stream; name=partial-knn-1.patchDownload
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
new file mode 100644
index e97ab8f..6ad5677
*** a/src/backend/access/gist/gistget.c
--- b/src/backend/access/gist/gistget.c
***************
*** 16,21 ****
--- 16,22 ----
  
  #include "access/gist_private.h"
  #include "access/relscan.h"
+ #include "catalog/index.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "utils/builtins.h"
*************** gistindex_keytest(IndexScanDesc scan,
*** 55,61 ****
  	GISTSTATE  *giststate = so->giststate;
  	ScanKey		key = scan->keyData;
  	int			keySize = scan->numberOfKeys;
! 	double	   *distance_p;
  	Relation	r = scan->indexRelation;
  
  	*recheck_p = false;
--- 56,62 ----
  	GISTSTATE  *giststate = so->giststate;
  	ScanKey		key = scan->keyData;
  	int			keySize = scan->numberOfKeys;
! 	GISTSearchTreeItemDistance *distance_p;
  	Relation	r = scan->indexRelation;
  
  	*recheck_p = false;
*************** gistindex_keytest(IndexScanDesc scan,
*** 72,78 ****
  		if (GistPageIsLeaf(page))		/* shouldn't happen */
  			elog(ERROR, "invalid GiST tuple found on leaf page");
  		for (i = 0; i < scan->numberOfOrderBys; i++)
! 			so->distances[i] = -get_float8_infinity();
  		return true;
  	}
  
--- 73,82 ----
  		if (GistPageIsLeaf(page))		/* shouldn't happen */
  			elog(ERROR, "invalid GiST tuple found on leaf page");
  		for (i = 0; i < scan->numberOfOrderBys; i++)
! 		{
! 			so->distances[i].value = -get_float8_infinity();
! 			so->distances[i].recheck = false;
! 		}
  		return true;
  	}
  
*************** gistindex_keytest(IndexScanDesc scan,
*** 170,176 ****
  		if ((key->sk_flags & SK_ISNULL) || isNull)
  		{
  			/* Assume distance computes as null and sorts to the end */
! 			*distance_p = get_float8_infinity();
  		}
  		else
  		{
--- 174,181 ----
  		if ((key->sk_flags & SK_ISNULL) || isNull)
  		{
  			/* Assume distance computes as null and sorts to the end */
! 			distance_p->value = get_float8_infinity();
! 			distance_p->recheck = false;
  		}
  		else
  		{
*************** gistindex_keytest(IndexScanDesc scan,
*** 195,208 ****
  			 * can't tolerate lossy distance calculations on leaf tuples;
  			 * there is no opportunity to re-sort the tuples afterwards.
  			 */
! 			dist = FunctionCall4Coll(&key->sk_func,
  									 key->sk_collation,
  									 PointerGetDatum(&de),
  									 key->sk_argument,
  									 Int32GetDatum(key->sk_strategy),
! 									 ObjectIdGetDatum(key->sk_subtype));
  
! 			*distance_p = DatumGetFloat8(dist);
  		}
  
  		key++;
--- 200,215 ----
  			 * can't tolerate lossy distance calculations on leaf tuples;
  			 * there is no opportunity to re-sort the tuples afterwards.
  			 */
! 			distance_p->recheck = false;
! 			dist = FunctionCall5Coll(&key->sk_func,
  									 key->sk_collation,
  									 PointerGetDatum(&de),
  									 key->sk_argument,
  									 Int32GetDatum(key->sk_strategy),
! 									 ObjectIdGetDatum(key->sk_subtype),
! 									 PointerGetDatum(&distance_p->recheck));
  
! 			distance_p->value = DatumGetFloat8(dist);
  		}
  
  		key++;
*************** gistindex_keytest(IndexScanDesc scan,
*** 234,240 ****
   * sibling will be processed next.
   */
  static void
! gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
  			 TIDBitmap *tbm, int64 *ntids)
  {
  	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
--- 241,247 ----
   * sibling will be processed next.
   */
  static void
! gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, GISTSearchTreeItemDistance *myDistances,
  			 TIDBitmap *tbm, int64 *ntids)
  {
  	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
*************** gistScanPage(IndexScanDesc scan, GISTSea
*** 284,290 ****
  		tmpItem->head = item;
  		tmpItem->lastHeap = NULL;
  		memcpy(tmpItem->distances, myDistances,
! 			   sizeof(double) * scan->numberOfOrderBys);
  
  		(void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew);
  
--- 291,297 ----
  		tmpItem->head = item;
  		tmpItem->lastHeap = NULL;
  		memcpy(tmpItem->distances, myDistances,
! 			   sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys);
  
  		(void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew);
  
*************** gistScanPage(IndexScanDesc scan, GISTSea
*** 375,381 ****
  			tmpItem->head = item;
  			tmpItem->lastHeap = GISTSearchItemIsHeap(*item) ? item : NULL;
  			memcpy(tmpItem->distances, so->distances,
! 				   sizeof(double) * scan->numberOfOrderBys);
  
  			(void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew);
  
--- 382,388 ----
  			tmpItem->head = item;
  			tmpItem->lastHeap = GISTSearchItemIsHeap(*item) ? item : NULL;
  			memcpy(tmpItem->distances, so->distances,
! 				   sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys);
  
  			(void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew);
  
*************** gistScanPage(IndexScanDesc scan, GISTSea
*** 387,392 ****
--- 394,485 ----
  }
  
  /*
+  * Do this tree item distance values needs recheck?
+  */
+ static bool
+ searchTreeItemNeedDistanceRecheck(IndexScanDesc scan, GISTSearchTreeItem *item)
+ {
+ 	int i;
+ 	for (i = 0; i < scan->numberOfOrderBys; i++)
+ 	{
+ 		if (item->distances[i].recheck)
+ 			return true;
+ 	}
+ 	return false;
+ }
+ 
+ /*
+  * Recheck distance values of item from heap and reinsert it into RB-tree.
+  */
+ static void
+ searchTreeItemDistanceRecheck(IndexScanDesc scan, GISTSearchTreeItem *treeItem,
+ 		GISTSearchItem *item)
+ {
+ 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ 	GISTSearchTreeItem *tmpItem = so->tmpTreeItem;
+ 	Buffer	buffer;
+ 	bool	got_heap_tuple, all_dead;
+ 	HeapTupleData tup;
+ 	Datum	values[INDEX_MAX_KEYS];
+ 	bool	isnull[INDEX_MAX_KEYS];
+ 	bool	isNew;
+ 	int		i;
+ 
+ 	buffer = ReadBuffer(scan->heapRelation,
+ 			ItemPointerGetBlockNumber(&item->data.heap.heapPtr));
+ 	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ 	got_heap_tuple = heap_hot_search_buffer(&item->data.heap.heapPtr,
+ 											scan->heapRelation,
+ 											buffer,
+ 											scan->xs_snapshot,
+ 											&tup,
+ 											&all_dead,
+ 											true);
+ 	if (!got_heap_tuple)
+ 	{
+ 		UnlockReleaseBuffer(buffer);
+ 		pfree(item);
+ 		return;
+ 	}
+ 
+ 	memcpy(tmpItem, treeItem,  GSTIHDRSZ +
+ 			sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys);
+ 	tmpItem->head = item;
+ 	tmpItem->lastHeap = item;
+ 	item->next = NULL;
+ 
+ 	ExecStoreTuple(&tup, so->slot, InvalidBuffer, false);
+ 	FormIndexDatum(so->indexInfo, so->slot, so->estate, values, isnull);
+ 
+ 	for (i = 0; i < scan->numberOfOrderBys; i++)
+ 	{
+ 		if (tmpItem->distances[i].recheck)
+ 		{
+ 			ScanKey	key = scan->orderByData + i;
+ 			float8 newDistance;
+ 
+ 			tmpItem->distances[i].recheck = false;
+ 			if (isnull[key->sk_attno - 1])
+ 			{
+ 				tmpItem->distances[i].value = -get_float8_infinity();
+ 				continue;
+ 			}
+ 
+ 			newDistance = DatumGetFloat8(
+ 				FunctionCall2Coll(&so->orderByRechecks[i],
+ 					 key->sk_collation,
+ 					 values[key->sk_attno - 1],
+ 					 key->sk_argument));
+ 
+ 			tmpItem->distances[i].value = newDistance;
+ 
+ 		}
+ 	}
+ 	(void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew);
+ 	UnlockReleaseBuffer(buffer);
+ }
+ 
+ /*
   * Extract next item (in order) from search queue
   *
   * Returns a GISTSearchItem or NULL.  Caller must pfree item when done with it.
*************** gistScanPage(IndexScanDesc scan, GISTSea
*** 396,403 ****
   * the distances value for the item.
   */
  static GISTSearchItem *
! getNextGISTSearchItem(GISTScanOpaque so)
  {
  	for (;;)
  	{
  		GISTSearchItem *item;
--- 489,498 ----
   * the distances value for the item.
   */
  static GISTSearchItem *
! getNextGISTSearchItem(IndexScanDesc scan)
  {
+ 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ 
  	for (;;)
  	{
  		GISTSearchItem *item;
*************** getNextGISTSearchItem(GISTScanOpaque so)
*** 418,423 ****
--- 513,526 ----
  			so->curTreeItem->head = item->next;
  			if (item == so->curTreeItem->lastHeap)
  				so->curTreeItem->lastHeap = NULL;
+ 
+ 			/* Recheck distance from heap tuple if needed */
+ 			if (GISTSearchItemIsHeap(*item) &&
+ 				searchTreeItemNeedDistanceRecheck(scan, so->curTreeItem))
+ 			{
+ 				searchTreeItemDistanceRecheck(scan, so->curTreeItem, item);
+ 				continue;
+ 			}
  			/* Return item; caller is responsible to pfree it */
  			return item;
  		}
*************** getNextNearest(IndexScanDesc scan)
*** 441,447 ****
  
  	do
  	{
! 		GISTSearchItem *item = getNextGISTSearchItem(so);
  
  		if (!item)
  			break;
--- 544,550 ----
  
  	do
  	{
! 		GISTSearchItem *item = getNextGISTSearchItem(scan);
  
  		if (!item)
  			break;
*************** gistgettuple(PG_FUNCTION_ARGS)
*** 521,527 ****
  			/* find and process the next index page */
  			do
  			{
! 				GISTSearchItem *item = getNextGISTSearchItem(so);
  
  				if (!item)
  					PG_RETURN_BOOL(false);
--- 624,630 ----
  			/* find and process the next index page */
  			do
  			{
! 				GISTSearchItem *item = getNextGISTSearchItem(scan);
  
  				if (!item)
  					PG_RETURN_BOOL(false);
*************** gistgetbitmap(PG_FUNCTION_ARGS)
*** 573,579 ****
  	 */
  	for (;;)
  	{
! 		GISTSearchItem *item = getNextGISTSearchItem(so);
  
  		if (!item)
  			break;
--- 676,682 ----
  	 */
  	for (;;)
  	{
! 		GISTSearchItem *item = getNextGISTSearchItem(scan);
  
  		if (!item)
  			break;
diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
new file mode 100644
index 3a45781..afe447f
*** a/src/backend/access/gist/gistproc.c
--- b/src/backend/access/gist/gistproc.c
*************** gist_poly_consistent(PG_FUNCTION_ARGS)
*** 1094,1099 ****
--- 1094,1100 ----
  	PG_RETURN_BOOL(result);
  }
  
+ 
  /**************************************************
   * Circle ops
   **************************************************/
*************** computeDistance(bool isLeaf, BOX *box, P
*** 1270,1275 ****
--- 1271,1337 ----
  	return result;
  }
  
+ static double
+ computeDistanceMBR(BOX *box, Point *point)
+ {
+ 	double		result = 0.0;
+ 
+ 	if (point->x <= box->high.x && point->x >= box->low.x &&
+ 			 point->y <= box->high.y && point->y >= box->low.y)
+ 	{
+ 		/* point inside the box */
+ 		result = 0.0;
+ 	}
+ 	else if (point->x <= box->high.x && point->x >= box->low.x)
+ 	{
+ 		/* point is over or below box */
+ 		Assert(box->low.y <= box->high.y);
+ 		if (point->y > box->high.y)
+ 			result = point->y - box->high.y;
+ 		else if (point->y < box->low.y)
+ 			result = box->low.y - point->y;
+ 		else
+ 			elog(ERROR, "inconsistent point values");
+ 	}
+ 	else if (point->y <= box->high.y && point->y >= box->low.y)
+ 	{
+ 		/* point is to left or right of box */
+ 		Assert(box->low.x <= box->high.x);
+ 		if (point->x > box->high.x)
+ 			result = point->x - box->high.x;
+ 		else if (point->x < box->low.x)
+ 			result = box->low.x - point->x;
+ 		else
+ 			elog(ERROR, "inconsistent point values");
+ 	}
+ 	else
+ 	{
+ 		/* closest point will be a vertex */
+ 		Point		p;
+ 		double		subresult;
+ 
+ 		result = point_point_distance(point, &box->low);
+ 
+ 		subresult = point_point_distance(point, &box->high);
+ 		if (result > subresult)
+ 			result = subresult;
+ 
+ 		p.x = box->low.x;
+ 		p.y = box->high.y;
+ 		subresult = point_point_distance(point, &p);
+ 		if (result > subresult)
+ 			result = subresult;
+ 
+ 		p.x = box->high.x;
+ 		p.y = box->low.y;
+ 		subresult = point_point_distance(point, &p);
+ 		if (result > subresult)
+ 			result = subresult;
+ 	}
+ 
+ 	return result;
+ }
+ 
  static bool
  gist_point_consistent_internal(StrategyNumber strategy,
  							   bool isLeaf, BOX *key, Point *query)
*************** gist_point_distance(PG_FUNCTION_ARGS)
*** 1451,1453 ****
--- 1513,1540 ----
  
  	PG_RETURN_FLOAT8(distance);
  }
+ 
+ Datum
+ gist_poly_distance(PG_FUNCTION_ARGS)
+ {
+ 	GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
+ 	StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
+ 	bool *recheck = (bool *) PG_GETARG_POINTER(4);
+ 	double		distance;
+ 	StrategyNumber strategyGroup = strategy / GeoStrategyNumberOffset;
+ 
+ 	*recheck = true;
+ 
+ 	switch (strategyGroup)
+ 	{
+ 		case PointStrategyNumberGroup:
+ 			distance = computeDistanceMBR(DatumGetBoxP(entry->key),
+ 									   PG_GETARG_POINT_P(1));
+ 			break;
+ 		default:
+ 			elog(ERROR, "unknown strategy number: %d", strategy);
+ 			distance = 0.0;		/* keep compiler quiet */
+ 	}
+ 
+ 	PG_RETURN_FLOAT8(distance);
+ }
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
new file mode 100644
index b5553ff..61e5597
*** a/src/backend/access/gist/gistscan.c
--- b/src/backend/access/gist/gistscan.c
***************
*** 17,22 ****
--- 17,25 ----
  #include "access/gist_private.h"
  #include "access/gistscan.h"
  #include "access/relscan.h"
+ #include "catalog/index.h"
+ #include "executor/executor.h"
+ #include "executor/tuptable.h"
  #include "utils/memutils.h"
  #include "utils/rel.h"
  
*************** GISTSearchTreeItemComparator(const RBNod
*** 36,43 ****
  	/* Order according to distance comparison */
  	for (i = 0; i < scan->numberOfOrderBys; i++)
  	{
! 		if (sa->distances[i] != sb->distances[i])
! 			return (sa->distances[i] > sb->distances[i]) ? 1 : -1;
  	}
  
  	return 0;
--- 39,53 ----
  	/* Order according to distance comparison */
  	for (i = 0; i < scan->numberOfOrderBys; i++)
  	{
! 		if (sa->distances[i].value != sb->distances[i].value)
! 			return (sa->distances[i].value > sb->distances[i].value) ? 1 : -1;
! 
! 		/*
! 		 * Items without recheck can be immediately returned. So they are
! 		 * placed first.
! 		 */
! 		if (sa->distances[i].recheck != sb->distances[i].recheck)
! 			return sa->distances[i].recheck ? 1 : -1;
  	}
  
  	return 0;
*************** GISTSearchTreeItemAllocator(void *arg)
*** 83,89 ****
  {
  	IndexScanDesc scan = (IndexScanDesc) arg;
  
! 	return palloc(GSTIHDRSZ + sizeof(double) * scan->numberOfOrderBys);
  }
  
  static void
--- 93,99 ----
  {
  	IndexScanDesc scan = (IndexScanDesc) arg;
  
! 	return palloc(GSTIHDRSZ + sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys);
  }
  
  static void
*************** gistbeginscan(PG_FUNCTION_ARGS)
*** 127,136 ****
  	so->queueCxt = giststate->scanCxt;	/* see gistrescan */
  
  	/* workspaces with size dependent on numberOfOrderBys: */
! 	so->tmpTreeItem = palloc(GSTIHDRSZ + sizeof(double) * scan->numberOfOrderBys);
! 	so->distances = palloc(sizeof(double) * scan->numberOfOrderBys);
  	so->qual_ok = true;			/* in case there are zero keys */
  
  	scan->opaque = so;
  
  	MemoryContextSwitchTo(oldCxt);
--- 137,153 ----
  	so->queueCxt = giststate->scanCxt;	/* see gistrescan */
  
  	/* workspaces with size dependent on numberOfOrderBys: */
! 	so->tmpTreeItem = palloc(GSTIHDRSZ + sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys);
! 	so->distances = palloc(sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys);
  	so->qual_ok = true;			/* in case there are zero keys */
  
+ 	if (scan->numberOfOrderBys > 0)
+ 	{
+ 		so->orderByRechecks = (FmgrInfo *)palloc(sizeof(FmgrInfo) * scan->numberOfOrderBys);
+ 		so->indexInfo = BuildIndexInfo(scan->indexRelation);
+ 		so->estate = CreateExecutorState();
+ 	}
+ 
  	scan->opaque = so;
  
  	MemoryContextSwitchTo(oldCxt);
*************** gistrescan(PG_FUNCTION_ARGS)
*** 186,194 ****
  		first_time = false;
  	}
  
  	/* create new, empty RBTree for search queue */
  	oldCxt = MemoryContextSwitchTo(so->queueCxt);
! 	so->queue = rb_create(GSTIHDRSZ + sizeof(double) * scan->numberOfOrderBys,
  						  GISTSearchTreeItemComparator,
  						  GISTSearchTreeItemCombiner,
  						  GISTSearchTreeItemAllocator,
--- 203,216 ----
  		first_time = false;
  	}
  
+ 	if (scan->numberOfOrderBys > 0 && !so->slot)
+ 	{
+ 		so->slot = MakeSingleTupleTableSlot(RelationGetDescr(scan->heapRelation));
+ 	}
+ 
  	/* create new, empty RBTree for search queue */
  	oldCxt = MemoryContextSwitchTo(so->queueCxt);
! 	so->queue = rb_create(GSTIHDRSZ + sizeof(GISTSearchTreeItemDistance) * scan->numberOfOrderBys,
  						  GISTSearchTreeItemComparator,
  						  GISTSearchTreeItemCombiner,
  						  GISTSearchTreeItemAllocator,
*************** gistrescan(PG_FUNCTION_ARGS)
*** 289,294 ****
--- 311,319 ----
  					 GIST_DISTANCE_PROC, skey->sk_attno,
  					 RelationGetRelationName(scan->indexRelation));
  
+ 			fmgr_info_copy(&so->orderByRechecks[i], &(skey->sk_func),
+ 													so->giststate->scanCxt);
+ 
  			fmgr_info_copy(&(skey->sk_func), finfo, so->giststate->scanCxt);
  
  			/* Restore prior fn_extra pointers, if not first time */
*************** gistendscan(PG_FUNCTION_ARGS)
*** 323,328 ****
--- 348,356 ----
  	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
  	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
  
+ 	if (so->slot)
+ 		ExecDropSingleTupleTableSlot(so->slot);
+ 
  	/*
  	 * freeGISTstate is enough to clean up everything made by gistbeginscan,
  	 * as well as the queueCxt if there is a separate context for it.
diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
new file mode 100644
index 41178a6..16b60fe
*** a/src/backend/utils/adt/geo_ops.c
--- b/src/backend/utils/adt/geo_ops.c
*************** dist_cpoly(PG_FUNCTION_ARGS)
*** 2664,2669 ****
--- 2664,2715 ----
  	PG_RETURN_FLOAT8(result);
  }
  
+ Datum
+ dist_polyp(PG_FUNCTION_ARGS)
+ {
+ 	POLYGON    *poly = PG_GETARG_POLYGON_P(0);
+ 	Point	   *point = PG_GETARG_POINT_P(1);
+ 	float8		result;
+ 	float8		d;
+ 	int			i;
+ 	LSEG		seg;
+ 
+ 	if (point_inside(point, poly->npts, poly->p) != 0)
+ 	{
+ #ifdef GEODEBUG
+ 		printf("dist_polyp- point inside of polygon\n");
+ #endif
+ 		PG_RETURN_FLOAT8(0.0);
+ 	}
+ 
+ 	/* initialize distance with segment between first and last points */
+ 	seg.p[0].x = poly->p[0].x;
+ 	seg.p[0].y = poly->p[0].y;
+ 	seg.p[1].x = poly->p[poly->npts - 1].x;
+ 	seg.p[1].y = poly->p[poly->npts - 1].y;
+ 	result = dist_ps_internal(point, &seg);
+ #ifdef GEODEBUG
+ 	printf("dist_polyp- segment 0/n distance is %f\n", result);
+ #endif
+ 
+ 	/* check distances for other segments */
+ 	for (i = 0; (i < poly->npts - 1); i++)
+ 	{
+ 		seg.p[0].x = poly->p[i].x;
+ 		seg.p[0].y = poly->p[i].y;
+ 		seg.p[1].x = poly->p[i + 1].x;
+ 		seg.p[1].y = poly->p[i + 1].y;
+ 		d = dist_ps_internal(point, &seg);
+ #ifdef GEODEBUG
+ 		printf("dist_polyp- segment %d distance is %f\n", (i + 1), d);
+ #endif
+ 		if (d < result)
+ 			result = d;
+ 	}
+ 
+ 	PG_RETURN_FLOAT8(result);
+ }
+ 
  
  /*---------------------------------------------------------------------
   *		interpt_
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
new file mode 100644
index cae6dbc..b2572df
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 16,22 ****
--- 16,24 ----
  
  #include "access/gist.h"
  #include "access/itup.h"
+ #include "executor/tuptable.h"
  #include "fmgr.h"
+ #include "nodes/execnodes.h"
  #include "storage/bufmgr.h"
  #include "storage/buffile.h"
  #include "utils/rbtree.h"
*************** typedef struct GISTSearchItem
*** 119,124 ****
--- 121,132 ----
  
  #define GISTSearchItemIsHeap(item)	((item).blkno == InvalidBlockNumber)
  
+ typedef struct GISTSearchTreeItemDistance
+ {
+ 	double	value;
+ 	bool	recheck;
+ } GISTSearchTreeItemDistance;
+ 
  /*
   * Within a GISTSearchTreeItem's chain, heap items always appear before
   * index-page items, since we want to visit heap items first.  lastHeap points
*************** typedef struct GISTSearchTreeItem
*** 129,135 ****
  	RBNode		rbnode;			/* this is an RBTree item */
  	GISTSearchItem *head;		/* first chain member */
  	GISTSearchItem *lastHeap;	/* last heap-tuple member, if any */
! 	double		distances[1];	/* array with numberOfOrderBys entries */
  } GISTSearchTreeItem;
  
  #define GSTIHDRSZ offsetof(GISTSearchTreeItem, distances)
--- 137,143 ----
  	RBNode		rbnode;			/* this is an RBTree item */
  	GISTSearchItem *head;		/* first chain member */
  	GISTSearchItem *lastHeap;	/* last heap-tuple member, if any */
! 	GISTSearchTreeItemDistance	distances[1];	/* array with numberOfOrderBys entries */
  } GISTSearchTreeItem;
  
  #define GSTIHDRSZ offsetof(GISTSearchTreeItem, distances)
*************** typedef struct GISTScanOpaqueData
*** 149,160 ****
  
  	/* pre-allocated workspace arrays */
  	GISTSearchTreeItem *tmpTreeItem;	/* workspace to pass to rb_insert */
! 	double	   *distances;		/* output area for gistindex_keytest */
  
  	/* In a non-ordered search, returnable heap items are stored here: */
  	GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
  	OffsetNumber nPageData;		/* number of valid items in array */
  	OffsetNumber curPageData;	/* next item to return */
  } GISTScanOpaqueData;
  
  typedef GISTScanOpaqueData *GISTScanOpaque;
--- 157,172 ----
  
  	/* pre-allocated workspace arrays */
  	GISTSearchTreeItem *tmpTreeItem;	/* workspace to pass to rb_insert */
! 	GISTSearchTreeItemDistance	*distances;		/* output area for gistindex_keytest */
  
  	/* In a non-ordered search, returnable heap items are stored here: */
  	GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
  	OffsetNumber nPageData;		/* number of valid items in array */
  	OffsetNumber curPageData;	/* next item to return */
+ 	FmgrInfo	*orderByRechecks;
+ 	IndexInfo *indexInfo;
+ 	TupleTableSlot *slot;
+ 	EState *estate;
  } GISTScanOpaqueData;
  
  typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
new file mode 100644
index c8a548c..e7a79c6
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
*************** DATA(insert (	2594   604 604 11 s 2577 7
*** 638,643 ****
--- 638,644 ----
  DATA(insert (	2594   604 604 12 s 2576 783 0 ));
  DATA(insert (	2594   604 604 13 s 2861 783 0 ));
  DATA(insert (	2594   604 604 14 s 2860 783 0 ));
+ DATA(insert (	2594   604 600 15 o 3569 783 1970 ));
  
  /*
   *	gist circle_ops
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
new file mode 100644
index 53a3a7a..29c7c09
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
*************** DATA(insert (	2594   604 604 4 2580 ));
*** 188,193 ****
--- 188,194 ----
  DATA(insert (	2594   604 604 5 2581 ));
  DATA(insert (	2594   604 604 6 2582 ));
  DATA(insert (	2594   604 604 7 2584 ));
+ DATA(insert (	2594   604 604 8 3567 ));
  DATA(insert (	2595   718 718 1 2591 ));
  DATA(insert (	2595   718 718 2 2583 ));
  DATA(insert (	2595   718 718 3 2592 ));
diff --git a/src/include/catalog/pg_operator.h b/src/include/catalog/pg_operator.h
new file mode 100644
index 78efaa5..32ac483
*** a/src/include/catalog/pg_operator.h
--- b/src/include/catalog/pg_operator.h
*************** DATA(insert OID = 709 (  "<->"	   PGNSP 
*** 591,596 ****
--- 591,598 ----
  DESCR("distance between");
  DATA(insert OID = 712 (  "<->"	   PGNSP PGUID b f f 604 604 701 712	 0 poly_distance - - ));
  DESCR("distance between");
+ DATA(insert OID = 3569 (  "<->"	   PGNSP PGUID b f f 604 600 701 0 		 0 dist_polyp - - ));
+ DESCR("distance between");
  
  DATA(insert OID = 713 (  "<>"	   PGNSP PGUID b f f 600 600	16 713 510 point_ne neqsel neqjoinsel ));
  DESCR("not equal");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index 0117500..85d077b
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 726 (  dist_lb		   PGN
*** 809,814 ****
--- 809,815 ----
  DATA(insert OID = 727 (  dist_sl		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 701 "601 628" _null_ _null_ _null_ _null_	dist_sl _null_ _null_ _null_ ));
  DATA(insert OID = 728 (  dist_cpoly		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 701 "718 604" _null_ _null_ _null_ _null_	dist_cpoly _null_ _null_ _null_ ));
  DATA(insert OID = 729 (  poly_distance	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 701 "604 604" _null_ _null_ _null_ _null_	poly_distance _null_ _null_ _null_ ));
+ DATA(insert OID = 3568 (  dist_polyp	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 701 "604 600" _null_ _null_ _null_ _null_	dist_polyp _null_ _null_ _null_ ));
  
  DATA(insert OID = 740 (  text_lt		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "25 25" _null_ _null_ _null_ _null_ text_lt _null_ _null_ _null_ ));
  DATA(insert OID = 741 (  text_le		   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "25 25" _null_ _null_ _null_ _null_ text_le _null_ _null_ _null_ ));
*************** DATA(insert OID = 2585 (  gist_poly_cons
*** 3937,3942 ****
--- 3938,3945 ----
  DESCR("GiST support");
  DATA(insert OID = 2586 (  gist_poly_compress	PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 "2281" _null_ _null_ _null_ _null_ gist_poly_compress _null_ _null_ _null_ ));
  DESCR("GiST support");
+ DATA(insert OID = 3567 (  gist_poly_distance	PGNSP PGUID 12 1 0 0 0 f f f f t f i 4 0 701 "2281 600 23 26" _null_ _null_ _null_ _null_	gist_poly_distance _null_ _null_ _null_ ));
+ DESCR("GiST support");
  DATA(insert OID = 2591 (  gist_circle_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 718 23 26 2281" _null_ _null_ _null_ _null_	gist_circle_consistent _null_ _null_ _null_ ));
  DESCR("GiST support");
  DATA(insert OID = 2592 (  gist_circle_compress	PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 2281 "2281" _null_ _null_ _null_ _null_ gist_circle_compress _null_ _null_ _null_ ));
diff --git a/src/include/utils/geo_decls.h b/src/include/utils/geo_decls.h
new file mode 100644
index 1e648c0..b8a04cb
*** a/src/include/utils/geo_decls.h
--- b/src/include/utils/geo_decls.h
*************** extern Datum circle_radius(PG_FUNCTION_A
*** 395,400 ****
--- 395,401 ----
  extern Datum circle_distance(PG_FUNCTION_ARGS);
  extern Datum dist_pc(PG_FUNCTION_ARGS);
  extern Datum dist_cpoly(PG_FUNCTION_ARGS);
+ extern Datum dist_polyp(PG_FUNCTION_ARGS);
  extern Datum circle_center(PG_FUNCTION_ARGS);
  extern Datum cr_circle(PG_FUNCTION_ARGS);
  extern Datum box_circle(PG_FUNCTION_ARGS);
*************** extern Datum gist_circle_consistent(PG_F
*** 418,423 ****
--- 419,425 ----
  extern Datum gist_point_compress(PG_FUNCTION_ARGS);
  extern Datum gist_point_consistent(PG_FUNCTION_ARGS);
  extern Datum gist_point_distance(PG_FUNCTION_ARGS);
+ extern Datum gist_poly_distance(PG_FUNCTION_ARGS);
  
  /* geo_selfuncs.c */
  extern Datum areasel(PG_FUNCTION_ARGS);
#2Andres Freund
andres@2ndquadrant.com
In reply to: Alexander Korotkov (#1)
Re: PoC: Partial sort

Hi,

Cool stuff.

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

*partial-knn-1.patch*

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;

----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?
Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#2)
Re: PoC: Partial sort

Hi!

Thanks for feedback!

On Sat, Dec 14, 2013 at 4:54 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi,

Cool stuff.

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or

do

sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

In this patch I don't do full sort of dataset. For instance, index returns
data ordered by first column and we need to order them also by second
column. Then this node sorts groups (assumed to be small) where values of
the first column are same by value of second column. And with limit clause
only required number of such groups will be processed. But, I don't think
we should expect pre-sorted values of second column inside a group.

*partial-knn-1.patch*

KNN-GiST provides ability to get ordered results from index, but this

order

is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;

----------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?

KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
inside GiST and reinserted into same RB-tree. It appears to be much easier
implementation for PoC and also looks very efficient. I'm not sure what is
actually right design for it. This is what I like to discuss.

------
With best regards,
Alexander Korotkov.

#4Jeremy Harris
jgh@wizmail.org
In reply to: Andres Freund (#2)
Re: PoC: Partial sort

On 14/12/13 12:54, Andres Freund wrote:

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

Eg: /messages/by-id/5291467E.6070807@wizmail.org

Maybe Alexander and I should bash our heads together.

--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Martijn van Oosterhout
kleptog@svana.org
In reply to: Alexander Korotkov (#3)
Re: PoC: Partial sort

On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

In this patch I don't do full sort of dataset. For instance, index returns
data ordered by first column and we need to order them also by second
column. Then this node sorts groups (assumed to be small) where values of
the first column are same by value of second column. And with limit clause
only required number of such groups will be processed. But, I don't think
we should expect pre-sorted values of second column inside a group.

Nice. I imagine this would be mostly beneficial for fast-start plans,
since you no longer need to sort the whole table prior to returning the
first tuple.

Reduced memory usage might be a factor, especially for large sorts
where you otherwise might need to spool to disk.

You can now use an index on (a) to improve sorting for (a,b).

Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
log l), useful for large n.

Minor comments:

I find cmpTuple a bad name. That's what it's doing but perhaps
cmpSkipColumns would be clearer.

I think it's worthwhile adding a seperate path for the skipCols = 0
case, to avoid extra copies.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#6Andres Freund
andres@2ndquadrant.com
In reply to: Alexander Korotkov (#3)
Re: PoC: Partial sort

Hi,

Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

In this patch I don't do full sort of dataset. For instance, index returns
data ordered by first column and we need to order them also by second
column.

Ah, that makes sense.

But, I don't think we should expect pre-sorted values of second column
inside a group.

Yes, if you do it that way, there doesn't seem to any need to assume
that any more than we usually do.

I think you should make the explain output reflect the fact that we're
assuming v1 is presorted and just sorting v2. I'd be happy enough with:
Sort Key: v1, v2
Partial Sort: v2
or even just
"Partial Sort Key: [v1,] v2"
but I am sure others disagree.

*partial-knn-1.patch*

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?

KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
inside GiST and reinserted into same RB-tree. It appears to be much easier
implementation for PoC and also looks very efficient. I'm not sure what is
actually right design for it. This is what I like to discuss.

I don't have enough clue about gist to say wether it's the right design,
but it doesn't look wrong to my eyes. It'd probably be useful to export
the knowledge that we are rechecking and how often that happens to the
outside.
While I didn't really look into the patch, I noticed in passing that you
pass a all_dead variable to heap_hot_search_buffer without using the
result - just pass NULL instead, that performs a bit less work.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andreas Karlsson
andreas@proxel.se
In reply to: Alexander Korotkov (#1)
Re: PoC: Partial sort

On 12/14/2013 10:59 AM, Alexander Korotkov wrote:

This patch allows to use index for order-by if order-by clause and index
has non-empty common prefix. So, index gives right ordering for first n
order-by columns. In order to provide right order for rest m columns,
sort node is inserted. This sort node sorts groups of tuples where
values of first n order-by columns are equal.

I recently looked at the same problem. I see that you solved the
rescanning problem by simply forcing the sort to be redone on
ExecReScanSort if you have done a partial sort.

My idea for a solution was to modify tuplesort to allow storing the
already sorted keys in either memtuples or the sort result file, but
setting a field so it does not sort thee already sorted tuples again.
This would allow the rescan to work as it used to, but I am unsure how
clean or ugly this code would be. Was this something you considered?

--
Andreas Karlsson

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Martijn van Oosterhout (#5)
Re: PoC: Partial sort

On Sat, Dec 14, 2013 at 6:39 PM, Martijn van Oosterhout
<kleptog@svana.org>wrote:

On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being

pre-sorted?

I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

In this patch I don't do full sort of dataset. For instance, index

returns

data ordered by first column and we need to order them also by second
column. Then this node sorts groups (assumed to be small) where values of
the first column are same by value of second column. And with limit

clause

only required number of such groups will be processed. But, I don't think
we should expect pre-sorted values of second column inside a group.

Nice. I imagine this would be mostly beneficial for fast-start plans,
since you no longer need to sort the whole table prior to returning the
first tuple.

Reduced memory usage might be a factor, especially for large sorts
where you otherwise might need to spool to disk.

You can now use an index on (a) to improve sorting for (a,b).

Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
log l), useful for large n.

Agree. Your reasoning looks correct.

Minor comments:

I find cmpTuple a bad name. That's what it's doing but perhaps
cmpSkipColumns would be clearer.

I think it's worthwhile adding a seperate path for the skipCols = 0
case, to avoid extra copies.

Thanks. I'll take care about.

------
With best regards,
Alexander Korotkov.

#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andres Freund (#6)
Re: PoC: Partial sort

On Sat, Dec 14, 2013 at 7:04 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi,

Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test

(cost=0.42..47604.42

rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being

pre-sorted?

I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

In this patch I don't do full sort of dataset. For instance, index

returns

data ordered by first column and we need to order them also by second
column.

Ah, that makes sense.

But, I don't think we should expect pre-sorted values of second column
inside a group.

Yes, if you do it that way, there doesn't seem to any need to assume
that any more than we usually do.

I think you should make the explain output reflect the fact that we're
assuming v1 is presorted and just sorting v2. I'd be happy enough with:
Sort Key: v1, v2
Partial Sort: v2
or even just
"Partial Sort Key: [v1,] v2"
but I am sure others disagree.

Sure, I just didn't change explain output yet. It should look like what you
propose.

*partial-knn-1.patch*

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?

KNN-GiST contain RB-tree of scanned items. In this patch item is

rechecked

inside GiST and reinserted into same RB-tree. It appears to be much

easier

implementation for PoC and also looks very efficient. I'm not sure what

is

actually right design for it. This is what I like to discuss.

I don't have enough clue about gist to say wether it's the right design,
but it doesn't look wrong to my eyes. It'd probably be useful to export
the knowledge that we are rechecking and how often that happens to the
outside.
While I didn't really look into the patch, I noticed in passing that you
pass a all_dead variable to heap_hot_search_buffer without using the
result - just pass NULL instead, that performs a bit less work.

Useful notice, thanks.

------
With best regards,
Alexander Korotkov.

#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andreas Karlsson (#7)
Re: PoC: Partial sort

On Sat, Dec 14, 2013 at 11:47 PM, Andreas Karlsson <andreas@proxel.se>wrote:

On 12/14/2013 10:59 AM, Alexander Korotkov wrote:

This patch allows to use index for order-by if order-by clause and index
has non-empty common prefix. So, index gives right ordering for first n
order-by columns. In order to provide right order for rest m columns,
sort node is inserted. This sort node sorts groups of tuples where
values of first n order-by columns are equal.

I recently looked at the same problem. I see that you solved the
rescanning problem by simply forcing the sort to be redone on
ExecReScanSort if you have done a partial sort.

Naturally, I'm sure I solved it at all :) I just get version of patch
working for very limited use-cases.

My idea for a solution was to modify tuplesort to allow storing the
already sorted keys in either memtuples or the sort result file, but
setting a field so it does not sort thee already sorted tuples again. This
would allow the rescan to work as it used to, but I am unsure how clean or
ugly this code would be. Was this something you considered?

I'm not sure. I believe that best answer depends on particular parameter:
how much memory we've for sort, how expensive is underlying node and how it
performs rescan, how big are groups in partial sort.

------
With best regards,
Alexander Korotkov.

#11Andreas Karlsson
andreas@proxel.se
In reply to: Alexander Korotkov (#10)
Re: PoC: Partial sort

On 12/18/2013 01:02 PM, Alexander Korotkov wrote:

My idea for a solution was to modify tuplesort to allow storing the
already sorted keys in either memtuples or the sort result file, but
setting a field so it does not sort thee already sorted tuples
again. This would allow the rescan to work as it used to, but I am
unsure how clean or ugly this code would be. Was this something you
considered?

I'm not sure. I believe that best answer depends on particular
parameter: how much memory we've for sort, how expensive is underlying
node and how it performs rescan, how big are groups in partial sort.

Yes, if one does not need a rescan your solution will use less memory
and about the same amount of CPU (if the tuplesort does not spill to
disk). While if we keep all the already sorted tuples in the tuplesort
rescans will be cheap but more memory will be used with an increased
chance of spilling to disk.

--
Andreas Karlsson

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#1)
1 attachment(s)
Re: PoC: Partial sort

Hi!

Next revision. It expected to do better work with optimizer. It introduces
presorted_keys argument of cost_sort function which represent number of
keys already sorted in Path. Then this function uses estimate_num_groups to
estimate number of groups with different values of presorted keys and
assumes that dataset is uniformly divided by
groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
matching most part of path keys.
You can see it's working pretty good on single table queries.

create table test as (select id, (random()*5)::int as v1,
(random()*1000)::int as v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);
create index test_v1_v2_idx on test (v1, v2);
create index test_v2_idx on test (v2);
vacuum analyze;

postgres=# explain analyze select * from test order by v1, id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
Sort (cost=149244.84..151744.84 rows=1000000 width=12) (actual
time=2111.476..2586.493 rows=1000000 loops=1)
Sort Key: v1, id
Sort Method: external merge Disk: 21512kB
-> Seq Scan on test (cost=0.00..15406.00 rows=1000000 width=12)
(actual time=0.012..113.815 rows=1000000 loops=1)
Total runtime: 2683.011 ms
(5 rows)

postgres=# explain analyze select * from test order by v1, id limit 10;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11441.77..11442.18 rows=10 width=12) (actual
time=79.980..79.982 rows=10 loops=1)
-> Partial sort (cost=11441.77..53140.44 rows=1000000 width=12)
(actual time=79.978..79.978 rows=10 loops=1)
Sort Key: v1, id
Presorted Key: v1
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47038.83
rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
Total runtime: 81.786 ms
(7 rows)

postgres=# explain analyze select * from test order by v1, v2 limit 10;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.90 rows=10 width=12) (actual time=0.031..0.047
rows=10 loops=1)
-> Index Scan using test_v1_v2_idx on test (cost=0.42..47286.28
rows=1000000 width=12) (actual time=0.029..0.043 rows=10 loops=1)
Total runtime: 0.083 ms
(3 rows)

postgres=# explain analyze select * from test order by v2, id;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
Partial sort (cost=97.75..99925.50 rows=1000000 width=12) (actual
time=1.069..1299.481 rows=1000000 loops=1)
Sort Key: v2, id
Presorted Key: v2
Sort Method: quicksort Memory: 52kB
-> Index Scan using test_v2_idx on test (cost=0.42..47603.79
rows=1000000 width=12) (actual time=0.030..812.083 rows=1000000 loops=1)
Total runtime: 1393.850 ms
(6 rows)

However, work with joins needs more improvements.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-2.patchapplication/octet-stream; name=partial-sort-2.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index bd5428d..9edcc44
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_sort_keys(SortState *so
*** 77,83 ****
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_sort_keys_common(PlanState *planstate,
! 					  int nkeys, AttrNumber *keycols,
  					  List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
--- 77,83 ----
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  					   ExplainState *es);
  static void show_sort_keys_common(PlanState *planstate,
! 					  int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					  List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 901,907 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 901,910 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1694,1700 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_keys_common((PlanState *) sortstate,
! 						  plan->numCols, plan->sortColIdx,
  						  ancestors, es);
  }
  
--- 1697,1703 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_keys_common((PlanState *) sortstate,
! 						  plan->numCols, plan->skipCols, plan->sortColIdx,
  						  ancestors, es);
  }
  
*************** show_merge_append_keys(MergeAppendState 
*** 1708,1724 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_keys_common((PlanState *) mstate,
! 						  plan->numCols, plan->sortColIdx,
  						  ancestors, es);
  }
  
  static void
! show_sort_keys_common(PlanState *planstate, int nkeys, AttrNumber *keycols,
! 					  List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *result = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
--- 1711,1728 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_keys_common((PlanState *) mstate,
! 						  plan->numCols, 0, plan->sortColIdx,
  						  ancestors, es);
  }
  
  static void
! show_sort_keys_common(PlanState *planstate, int nkeys, int nPresortedKeys,
! 		AttrNumber *keycols, List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *resultSort = NIL;
! 	List	   *resultPresorted = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
*************** show_sort_keys_common(PlanState *plansta
*** 1745,1754 ****
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 		result = lappend(result, exprstr);
  	}
  
! 	ExplainPropertyList("Sort Key", result, es);
  }
  
  /*
--- 1749,1763 ----
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 
! 		if (keyno < nPresortedKeys)
! 			resultPresorted = lappend(resultPresorted, exprstr);
! 		resultSort = lappend(resultSort, exprstr);
  	}
  
! 	ExplainPropertyList("Sort Key", resultSort, es);
! 	if (nPresortedKeys > 0)
! 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 09b2eb0..e6a9a0c
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,52 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 	SortSupport sortKeys = tuplesort_get_sortkeys(node->tuplesortstate);
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = sortKeys[i].ssup_attno;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (ApplySortComparator(datumA, isnullA,
+                                   datumB, isnullB,
+                                   &sortKeys[i]))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 69,75 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	int skipCols = ((Sort *)node->ss.ps.plan)->skipCols;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,131 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
! 		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
! 											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
- 			slot = ExecProcNode(outerNode);
- 
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
! 	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
--- 82,204 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 	PlanState  *outerNode;
! 	TupleDesc	tupDesc;
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	tuplesortstate = tuplesort_begin_heap(tupDesc,
! 										  plannode->numCols,
! 										  plannode->sortColIdx,
! 										  plannode->sortOperators,
! 										  plannode->collations,
! 										  plannode->nullsFirst,
! 										  work_mem,
! 										  node->randomAccess);
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound);
! 	node->tuplesortstate = (void *) tuplesortstate;
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
! 		if (skipCols == 0)
  		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 	node->bound_Done = node->bound;
! 	SO1_printf("ExecSort: %s\n", "sorting done");
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 247,255 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
  
  	/*
  	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index e3edcf6..d698559
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 735,740 ****
--- 735,741 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 50f0852..1a38407
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1281,1295 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1281,1302 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1319,1331 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1326,1367 ----
  		output_bytes = input_bytes;
  	}
  
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1335,1341 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1371,1377 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1346,1355 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1382,1391 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1357,1368 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
--- 1393,1404 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1373,1380 ****
--- 1409,1423 ----
  	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
  	 * counting the LIMIT otherwise.
  	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
  	run_cost += cpu_operator_cost * tuples;
  
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2075,2080 ****
--- 2118,2125 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2101,2106 ****
--- 2146,2153 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9c8ede6..cdb9ae7
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 312,343 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_fractional_path_for_pathkey
*** 389,394 ****
--- 415,423 ----
  										  double fraction)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
  
  	foreach(l, paths)
*************** get_cheapest_fractional_path_for_pathkey
*** 399,411 ****
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
  			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
--- 428,457 ----
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_fractional_path_costs(matched_path, path, fraction);
! 			if (matched_n_common_pathkeys == n_pathkeys && costs_cmp < 0)
! 				continue;
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
! 		if (n_common_pathkeys == 0)
  			continue;
  
! 		if ((
! 				n_common_pathkeys > matched_n_common_pathkeys
! 				||	(n_common_pathkeys == matched_n_common_pathkeys
! 					 && costs_cmp > 0)) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 		}
  	}
  	return matched_path;
  }
*************** right_merge_direction(PlannerInfo *root,
*** 1457,1472 ****
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
  
  	return 0;					/* path ordering not useful */
--- 1503,1522 ----
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
+ 	int n;
+ 
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n = pathkeys_common(root->query_pathkeys, pathkeys);
! 
! 	if (n != 0)
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return n;
  	}
  
  	return 0;					/* path ordering not useful */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index f2c122d..a300342
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 149,154 ****
--- 149,155 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 774,779 ****
--- 775,781 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 807,814 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 809,818 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2184,2192 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2188,2198 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2197,2205 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2203,2213 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 3739,3744 ****
--- 3747,3753 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3748,3754 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 3757,3764 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3762,3767 ****
--- 3772,3778 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4090,4096 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4101,4107 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4110,4116 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4121,4127 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4153,4159 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4164,4170 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4175,4181 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4186,4193 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4208,4214 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4220,4226 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 6670794..56ffb75
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** grouping_planner(PlannerInfo *root, doub
*** 1358,1367 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1358,1371 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1371,1382 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1375,1409 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1457,1469 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1484,1499 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1557,1563 ****
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
--- 1587,1595 ----
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan,
! 													 root->group_pathkeys,
! 													n_common_pathkeys_grouping);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
*************** grouping_planner(PlannerInfo *root, doub
*** 1600,1606 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 1632,1640 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1717,1729 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 1751,1767 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 1869,1887 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 1907,1927 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 1897,1908 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 1937,1951 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** choose_hashed_grouping(PlannerInfo *root
*** 2647,2652 ****
--- 2690,2696 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 2726,2732 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2770,2777 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 2742,2750 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 2787,2798 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 2759,2768 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2807,2818 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2815,2820 ****
--- 2865,2871 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 2880,2886 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2931,2938 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2897,2919 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2949,2978 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 3703,3710 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 3762,3770 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e249628..b0b5471
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 859,865 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 859,866 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index a7169ef..3d0a842
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 971,980 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 971,981 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 988,993 ****
--- 989,996 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1343,1349 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1346,1353 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index ea8af9f..29b90f2
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** free_sort_tuple(Tuplesortstate *state, S
*** 3455,3457 ****
--- 3455,3464 ----
  	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
  	pfree(stup->tuple);
  }
+ 
+ SortSupport
+ tuplesort_get_sortkeys(Tuplesortstate *state)
+ {
+ 	return state->sortKeys;
+ }
+ 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 5a40347..3723a18
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1663,1670 ****
--- 1663,1672 ----
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	bool		finished;
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	HeapTuple	prev;
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 101e22c..28b871e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 582,587 ****
--- 582,588 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 444ab74..e98fb0c
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 88,95 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 88,96 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 999adaa..7c09301
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 157,162 ****
--- 157,163 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index ba7ae7c..d33c615
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 50,60 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 50,61 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 25fa6de..267a988
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern void tuplesort_get_stats(Tuplesor
*** 108,113 ****
--- 109,116 ----
  
  extern int	tuplesort_merge_order(int64 allowedMem);
  
+ extern SortSupport tuplesort_get_sortkeys(Tuplesortstate *state);
+ 
  /*
   * These routines may only be called if randomAccess was specified 'true'.
   * Likewise, backwards scan in gettuple/getdatum is only allowed if
#13Martijn van Oosterhout
kleptog@svana.org
In reply to: Alexander Korotkov (#12)
Re: PoC: Partial sort

On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:

Hi!

Next revision. It expected to do better work with optimizer. It introduces
presorted_keys argument of cost_sort function which represent number of
keys already sorted in Path. Then this function uses estimate_num_groups to
estimate number of groups with different values of presorted keys and
assumes that dataset is uniformly divided by
groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
matching most part of path keys.
You can see it's working pretty good on single table queries.

Nice work! The plans look good and the calculated costs seem sane also.

I suppose the problem with the joins is generating the pathkeys?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#14Alexander Korotkov
aekorotkov@gmail.com
In reply to: Martijn van Oosterhout (#13)
Re: PoC: Partial sort

On Sun, Dec 22, 2013 at 8:12 PM, Martijn van Oosterhout
<kleptog@svana.org>wrote:

On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:

Hi!

Next revision. It expected to do better work with optimizer. It

introduces

presorted_keys argument of cost_sort function which represent number of
keys already sorted in Path. Then this function uses estimate_num_groups

to

estimate number of groups with different values of presorted keys and
assumes that dataset is uniformly divided by
groups. get_cheapest_fractional_path_for_pathkeys tries to select the

path

matching most part of path keys.
You can see it's working pretty good on single table queries.

Nice work! The plans look good and the calculated costs seem sane also.

I suppose the problem with the joins is generating the pathkeys?

In general, problem is that partial sort is alternative to do less
restrictive merge join and filter it's results. As far as I can see, taking
care about it require some rework of merge optimization. For now, I didn't
get what it's going to look like. I'll try to dig more into details.

------
With best regards,
Alexander Korotkov.

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Jeremy Harris (#4)
Re: PoC: Partial sort

On Sat, Dec 14, 2013 at 6:30 PM, Jeremy Harris <jgh@wizmail.org> wrote:

On 14/12/13 12:54, Andres Freund wrote:

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or
do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

------------------------------------------------------------

------------------------------------------------------------
------------------
Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

Eg: /messages/by-id/5291467E.6070807@wizmail.org

Maybe Alexander and I should bash our heads together.

Partial sort patch is mostly optimizer/executor improvement rather than
improvement of sort algorithm itself. But I would appreciate using
enchantments of sorting algorithms in my work.

------
With best regards,
Alexander Korotkov.

#16Andreas Karlsson
andreas@proxel.se
In reply to: Alexander Korotkov (#12)
Re: PoC: Partial sort

On 12/22/2013 04:38 PM, Alexander Korotkov wrote:

postgres=# explain analyze select * from test order by v1, id limit 10;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11441.77..11442.18 rows=10 width=12) (actual
time=79.980..79.982 rows=10 loops=1)
-> Partial sort (cost=11441.77..53140.44 rows=1000000 width=12)
(actual time=79.978..79.978 rows=10 loops=1)
Sort Key: v1, id
Presorted Key: v1
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47038.83
rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
Total runtime: 81.786 ms
(7 rows)

Have you thought about how do you plan to print which sort method and
how much memory was used? Several different sort methods may have been
use in the query. Should the largest amount of memory/disk be printed?

However, work with joins needs more improvements.

That would be really nice to have, but the patch seems useful even
without the improvements to joins.

--
Andreas Karlsson

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andreas Karlsson (#16)
1 attachment(s)
Re: PoC: Partial sort

On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas@proxel.se> wrote:

On 12/22/2013 04:38 PM, Alexander Korotkov wrote:

postgres=# explain analyze select * from test order by v1, id limit 10;
QUERY
PLAN
------------------------------------------------------------
------------------------------------------------------------
-----------------------
Limit (cost=11441.77..11442.18 rows=10 width=12) (actual
time=79.980..79.982 rows=10 loops=1)
-> Partial sort (cost=11441.77..53140.44 rows=1000000 width=12)
(actual time=79.978..79.978 rows=10 loops=1)
Sort Key: v1, id
Presorted Key: v1
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47038.83
rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
Total runtime: 81.786 ms
(7 rows)

Have you thought about how do you plan to print which sort method and how
much memory was used? Several different sort methods may have been use in
the query. Should the largest amount of memory/disk be printed?

Apparently, now amount of memory for sorted last group is printed. Your
proposal makes sense: largest amount of memory/disk should be printed.

However, work with joins needs more improvements.

That would be really nice to have, but the patch seems useful even without
the improvements to joins.

Attached revision of patch implements partial sort usage in merge joins.

create table test1 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);
create index test1_v1_idx on test1 (v1);
create index test2_v1_idx on test2 (v1);

create index test1_v1_idx on test1 (v1);
create index test2_v1_idx on test2 (v1);

# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 and t1.v2 =
t2.v2;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Merge Join (cost=2257.67..255273.39 rows=983360 width=24)
Merge Cond: ((t1.v1 = t2.v1) AND (t1.v2 = t2.v2))
-> Partial sort (cost=1128.84..116470.79 rows=1000000 width=12)
Sort Key: t1.v1, t1.v2
Presorted Key: t1.v1
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47604.01 rows=1000000 width=12)
-> Materialize (cost=1128.83..118969.00 rows=1000000 width=12)
-> Partial sort (cost=1128.83..116469.00 rows=1000000 width=12)
Sort Key: t2.v1, t2.v2
Presorted Key: t2.v1
-> Index Scan using test2_v1_idx on test2 t2
(cost=0.42..47602.22 rows=1000000 width=12)

I believe now patch covers desired functionality. I'm going to focus on
nailing down details, refactoring and documenting.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-3.patchapplication/octet-stream; name=partial-sort-3.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9969a25..07cb66d
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_agg_keys(AggState *asta
*** 81,87 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
--- 81,87 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 905,911 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 905,914 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1705,1711 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1708,1714 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_merge_append_keys(MergeAppendState 
*** 1719,1725 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1722,1728 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_agg_keys(AggState *astate, List *an
*** 1737,1743 ****
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
--- 1740,1746 ----
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, 0, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
*************** show_group_keys(GroupState *gstate, List
*** 1755,1761 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
--- 1758,1764 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
*************** show_group_keys(GroupState *gstate, List
*** 1765,1777 ****
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *result = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
--- 1768,1781 ----
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate,  const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *resultSort = NIL;
! 	List	   *resultPresorted = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
*************** show_sort_group_keys(PlanState *planstat
*** 1798,1807 ****
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 		result = lappend(result, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, result, es);
  }
  
  /*
--- 1802,1816 ----
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 
! 		if (keyno < nPresortedKeys)
! 			resultPresorted = lappend(resultPresorted, exprstr);
! 		resultSort = lappend(resultSort, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, resultSort, es);
! 	if (nPresortedKeys > 0)
! 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 09b2eb0..1693d46
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,52 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 	SortSupport sortKeys = tuplesort_get_sortkeys(node->tuplesortstate);
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = sortKeys[i].ssup_attno;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (ApplySortComparator(datumA, isnullA,
+                                   datumB, isnullB,
+                                   &sortKeys[i]))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 69,75 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	int skipCols = ((Sort *)node->ss.ps.plan)->skipCols;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,131 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
! 		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
! 											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
- 			slot = ExecProcNode(outerNode);
- 
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
! 	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
--- 82,206 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 	PlanState  *outerNode;
! 	TupleDesc	tupDesc;
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
! 	tuplesortstate = tuplesort_begin_heap(tupDesc,
! 										  plannode->numCols,
! 										  plannode->sortColIdx,
! 										  plannode->sortOperators,
! 										  plannode->collations,
! 										  plannode->nullsFirst,
! 										  work_mem,
! 										  node->randomAccess);
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound);
! 	node->tuplesortstate = (void *) tuplesortstate;
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
! 		if (skipCols == 0)
  		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 	node->bound_Done = node->bound;
! 	SO1_printf("ExecSort: %s\n", "sorting done");
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 249,257 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
  
  	/*
  	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index e4184c5..b41213a
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 735,740 ****
--- 735,741 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 50f0852..1a38407
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1281,1295 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1281,1302 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1319,1331 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1326,1367 ----
  		output_bytes = input_bytes;
  	}
  
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1335,1341 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1371,1377 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1346,1355 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1382,1391 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1357,1368 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
--- 1393,1404 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1373,1380 ****
--- 1409,1423 ----
  	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
  	 * counting the LIMIT otherwise.
  	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
  	run_cost += cpu_operator_cost * tuples;
  
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2075,2080 ****
--- 2118,2125 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2101,2106 ****
--- 2146,2153 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index 5b477e5..5909dfe
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** sort_inner_and_outer(PlannerInfo *root,
*** 662,668 ****
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
--- 662,670 ----
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list,
! 														  NULL,
! 														  NULL);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
*************** match_unsorted_outer(PlannerInfo *root,
*** 832,837 ****
--- 834,840 ----
  		List	   *mergeclauses;
  		List	   *innersortkeys;
  		List	   *trialsortkeys;
+ 		List	   *outersortkeys;
  		Path	   *cheapest_startup_inner;
  		Path	   *cheapest_total_inner;
  		int			num_sortkeys;
*************** match_unsorted_outer(PlannerInfo *root,
*** 937,943 ****
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
--- 940,948 ----
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list,
! 													  joinrel,
! 													  &outersortkeys);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
*************** match_unsorted_outer(PlannerInfo *root,
*** 961,967 ****
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outerpath->pathkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
--- 966,972 ----
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outersortkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
*************** match_unsorted_outer(PlannerInfo *root,
*** 980,986 ****
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   NIL,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
--- 985,991 ----
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   outersortkeys,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1038,1044 ****
  		for (sortkeycnt = num_sortkeys; sortkeycnt > 0; sortkeycnt--)
  		{
  			Path	   *innerpath;
- 			List	   *newclauses = NIL;
  
  			/*
  			 * Look for an inner path ordered well enough for the first
--- 1043,1048 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1055,1073 ****
  				 compare_path_costs(innerpath, cheapest_total_inner,
  									TOTAL_COST) < 0))
  			{
- 				/* Found a cheap (or even-cheaper) sorted path */
- 				/* Select the right mergeclauses, if we didn't already */
- 				if (sortkeycnt < num_sortkeys)
- 				{
- 					newclauses =
- 						find_mergeclauses_for_pathkeys(root,
- 													   trialsortkeys,
- 													   false,
- 													   mergeclauses);
- 					Assert(newclauses != NIL);
- 				}
- 				else
- 					newclauses = mergeclauses;
  				try_mergejoin_path(root,
  								   joinrel,
  								   jointype,
--- 1059,1064 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1078,1086 ****
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   newclauses,
! 								   NIL,
! 								   NIL);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
--- 1069,1077 ----
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   mergeclauses,
! 								   outersortkeys,
! 								   innersortkeys);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1096,1119 ****
  				/* Found a cheap (or even-cheaper) sorted path */
  				if (innerpath != cheapest_total_inner)
  				{
- 					/*
- 					 * Avoid rebuilding clause list if we already made one;
- 					 * saves memory in big join trees...
- 					 */
- 					if (newclauses == NIL)
- 					{
- 						if (sortkeycnt < num_sortkeys)
- 						{
- 							newclauses =
- 								find_mergeclauses_for_pathkeys(root,
- 															   trialsortkeys,
- 															   false,
- 															   mergeclauses);
- 							Assert(newclauses != NIL);
- 						}
- 						else
- 							newclauses = mergeclauses;
- 					}
  					try_mergejoin_path(root,
  									   joinrel,
  									   jointype,
--- 1087,1092 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1124,1132 ****
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   newclauses,
! 									   NIL,
! 									   NIL);
  				}
  				cheapest_startup_inner = innerpath;
  			}
--- 1097,1105 ----
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   mergeclauses,
! 									   outersortkeys,
! 									   innersortkeys);
  				}
  				cheapest_startup_inner = innerpath;
  			}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9c8ede6..63c0b03
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static PathKey *make_canonical_pathkey(PlannerInfo *root,
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 313,344 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,373 ****
--- 395,421 ----
  	return matched_path;
  }
  
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0 ||
+ 			fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
*************** Path *
*** 386,411 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
--- 434,508 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							lfirst(list_head(key->pk_eclass->ec_members));
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 		if (n_common_pathkeys < matched_n_common_pathkeys ||
+ 				n_common_pathkeys == 0)
+ 			continue;
+ 
+ 		current_fraction = fraction;
+ 		if (n_common_pathkeys < n_pathkeys)
+ 		{
+ 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
+ 			current_fraction = Max(current_fraction, 1.0);
+ 		}
  
  		/*
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
  
! 		if ((
! 				n_common_pathkeys > matched_n_common_pathkeys
! 				||	(n_common_pathkeys == matched_n_common_pathkeys
! 					 && costs_cmp > 0)) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
  	return matched_path;
  }
*************** List *
*** 965,974 ****
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
--- 1062,1077 ----
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outersortkeys)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
+ 	bool	   *used = (bool *)palloc0(sizeof(bool) * list_length(restrictinfos));
+ 	int			k;
+ 	List	   *unusedRestrictinfos = NIL;
+ 	List	   *usedPathkeys = NIL;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1021,1026 ****
--- 1124,1130 ----
  		 * deal with the case in create_mergejoin_plan().
  		 *----------
  		 */
+ 		k = 0;
  		foreach(j, restrictinfos)
  		{
  			RestrictInfo *rinfo = (RestrictInfo *) lfirst(j);
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1033,1039 ****
--- 1137,1147 ----
  				clause_ec = rinfo->outer_is_left ?
  					rinfo->right_ec : rinfo->left_ec;
  			if (clause_ec == pathkey_ec)
+ 			{
  				matched_restrictinfos = lappend(matched_restrictinfos, rinfo);
+ 				used[k] = true;
+ 			}
+ 			k++;
  		}
  
  		/*
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1044,1049 ****
--- 1152,1159 ----
  		if (matched_restrictinfos == NIL)
  			break;
  
+ 		usedPathkeys = lappend(usedPathkeys, pathkey);
+ 
  		/*
  		 * If we did find usable mergeclause(s) for this sort-key position,
  		 * add them to result list.
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1051,1056 ****
--- 1161,1201 ----
  		mergeclauses = list_concat(mergeclauses, matched_restrictinfos);
  	}
  
+ 	if (outersortkeys)
+ 	{
+ 		List *addPathkeys, *addMergeclauses;
+ 
+ 		*outersortkeys = pathkeys;
+ 
+ 		if (!mergeclauses)
+ 			return mergeclauses;
+ 
+ 		k = 0;
+ 		foreach(i, restrictinfos)
+ 		{
+ 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);
+ 			if (!used[k])
+ 				unusedRestrictinfos = lappend(unusedRestrictinfos, rinfo);
+ 			k++;
+ 		}
+ 
+ 		if (!unusedRestrictinfos)
+ 			return mergeclauses;
+ 
+ 		addPathkeys = select_outer_pathkeys_for_merge(root,
+ 												unusedRestrictinfos, joinrel);
+ 
+ 		if (!addPathkeys)
+ 			return mergeclauses;
+ 
+ 		addMergeclauses = find_mergeclauses_for_pathkeys(root,
+ 				addPathkeys, true, unusedRestrictinfos, NULL, NULL);
+ 
+ 		*outersortkeys = list_concat(usedPathkeys, addPathkeys);
+ 		mergeclauses = list_concat(mergeclauses, addMergeclauses);
+ 
+ 	}
+ 
  	return mergeclauses;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1457,1472 ****
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
  
  	return 0;					/* path ordering not useful */
--- 1602,1621 ----
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
+ 	int n;
+ 
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n = pathkeys_common(root->query_pathkeys, pathkeys);
! 
! 	if (n != 0)
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return n;
  	}
  
  	return 0;					/* path ordering not useful */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index f2c122d..a300342
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 149,154 ****
--- 149,155 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 774,779 ****
--- 775,781 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 807,814 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 809,818 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2184,2192 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2188,2198 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2197,2205 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2203,2213 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 3739,3744 ****
--- 3747,3753 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3748,3754 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 3757,3764 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3762,3767 ****
--- 3772,3778 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4090,4096 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4101,4107 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4110,4116 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4121,4127 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4153,4159 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4164,4170 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4175,4181 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4186,4193 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4208,4214 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4220,4226 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 53fc238..4675402
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 494,500 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 494,502 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 1da4b2f..df5563a
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** grouping_planner(PlannerInfo *root, doub
*** 1349,1355 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1349,1357 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1365,1374 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1367,1380 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1378,1389 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1384,1418 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1464,1476 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1493,1508 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1564,1570 ****
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
--- 1596,1604 ----
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan,
! 													 root->group_pathkeys,
! 													n_common_pathkeys_grouping);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
*************** grouping_planner(PlannerInfo *root, doub
*** 1607,1613 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 1641,1649 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1724,1736 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 1760,1776 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 1876,1894 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 1916,1936 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 1904,1915 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 1946,1960 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** choose_hashed_grouping(PlannerInfo *root
*** 2654,2659 ****
--- 2699,2705 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 2735,2741 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2781,2788 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 2751,2759 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 2798,2809 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 2768,2777 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2818,2829 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2824,2829 ****
--- 2876,2882 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 2889,2895 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2942,2949 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2906,2928 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2960,2989 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 3712,3719 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 3773,3781 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e249628..b0b5471
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 859,865 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 859,866 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index a7169ef..3d0a842
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 971,980 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 971,981 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 988,993 ****
--- 989,996 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1343,1349 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1346,1353 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 52f05e6..6a09138
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** free_sort_tuple(Tuplesortstate *state, S
*** 3525,3527 ****
--- 3525,3534 ----
  	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
  	pfree(stup->tuple);
  }
+ 
+ SortSupport
+ tuplesort_get_sortkeys(Tuplesortstate *state)
+ {
+ 	return state->sortKeys;
+ }
+ 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 2a7b36e..76aab79
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1664,1671 ****
--- 1664,1673 ----
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	bool		finished;
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	HeapTuple	prev;
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 101e22c..28b871e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 582,587 ****
--- 582,588 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 444ab74..e98fb0c
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 88,95 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 88,96 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 999adaa..043641d
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 157,169 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 157,172 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
*************** extern void update_mergeclause_eclasses(
*** 185,191 ****
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
--- 188,196 ----
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outerpathkeys);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index ba7ae7c..d33c615
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 50,60 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 50,61 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5f87254..5a65cd2
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern void tuplesort_get_stats(Tuplesor
*** 111,116 ****
--- 112,119 ----
  
  extern int	tuplesort_merge_order(int64 allowedMem);
  
+ extern SortSupport tuplesort_get_sortkeys(Tuplesortstate *state);
+ 
  /*
   * These routines may only be called if randomAccess was specified 'true'.
   * Likewise, backwards scan in gettuple/getdatum is only allowed if
#18David Rowley
dgrowleyml@gmail.com
In reply to: Alexander Korotkov (#17)
Re: PoC: Partial sort

On Sat, Dec 28, 2013 at 9:28 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas@proxel.se>wrote:
Attached revision of patch implements partial sort usage in merge joins.

I'm looking forward to doing a bit of testing on this patch. I think it is
a really useful feature to get a bit more out of existing indexes.

I was about to test it tonight, but I'm having trouble getting the patch to
compile... I'm really wondering which compiler you are using as it seems
you're declaring your variables in some strange places.. See nodeSort.c
line 101. These variables are declared after there has been an if statement
in the same scope. That's not valid in C. (The patch did however apply
without any complaints).

Here's a list of the errors I get when compiling with visual studios on
windows.

"D:\Postgres\c\pgsql.sln" (default target) (1) ->
"D:\Postgres\c\postgres.vcxproj" (default target) (2) ->
(ClCompile target) ->
src\backend\executor\nodeSort.c(101): error C2275: 'Sort' : illegal use
of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(101): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(102): error C2275: 'PlanState' : illegal
use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(102): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2275: 'TupleDesc' : illegal
use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2146: syntax error : missing
';' before identifier 'tupDesc' [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(120): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(121): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(121): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(125): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(126): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(126): error C2223: left of '->numCols'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(127): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(127): error C2223: left of '->sortColIdx'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(128): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(128): error C2223: left of
'->sortOperators' must point to struct/union
[D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(129): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(129): error C2223: left of '->collations'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(130): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(130): error C2223: left of '->nullsFirst'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(132): error C2198: 'tuplesort_begin_heap'
: too few arguments for call [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(143): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(167): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]

13 Warning(s)
24 Error(s)

Regards

David Rowley

#19Alexander Korotkov
aekorotkov@gmail.com
In reply to: David Rowley (#18)
1 attachment(s)
Re: PoC: Partial sort

On Sat, Dec 28, 2013 at 1:04 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On Sat, Dec 28, 2013 at 9:28 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas@proxel.se>wrote:
Attached revision of patch implements partial sort usage in merge joins.

I'm looking forward to doing a bit of testing on this patch. I think it is
a really useful feature to get a bit more out of existing indexes.

I was about to test it tonight, but I'm having trouble getting the patch
to compile... I'm really wondering which compiler you are using as it seems
you're declaring your variables in some strange places.. See nodeSort.c
line 101. These variables are declared after there has been an if statement
in the same scope. That's not valid in C. (The patch did however apply
without any complaints).

Here's a list of the errors I get when compiling with visual studios on
windows.

"D:\Postgres\c\pgsql.sln" (default target) (1) ->
"D:\Postgres\c\postgres.vcxproj" (default target) (2) ->
(ClCompile target) ->
src\backend\executor\nodeSort.c(101): error C2275: 'Sort' : illegal use
of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(101): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(102): error C2275: 'PlanState' : illegal
use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(102): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2275: 'TupleDesc' : illegal
use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2146: syntax error :
missing ';' before identifier 'tupDesc' [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2065: 'tupDesc' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(120): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(121): error C2065: 'tupDesc' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(121): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(125): error C2065: 'tupDesc' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(126): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(126): error C2223: left of '->numCols'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(127): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(127): error C2223: left of
'->sortColIdx' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(128): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(128): error C2223: left of
'->sortOperators' must point to struct/union
[D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(129): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(129): error C2223: left of
'->collations' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(130): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(130): error C2223: left of
'->nullsFirst' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(132): error C2198:
'tuplesort_begin_heap' : too few arguments for call
[D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(143): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(167): error C2065: 'tupDesc' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]

13 Warning(s)
24 Error(s)

I've compiled it with clang. Yeah, there was mixed declarations. I've
rechecked it with gcc, now it gives no warnings. I didn't try it with
visual studio, but I hope it will be OK.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-4.patchapplication/octet-stream; name=partial-sort-4.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9969a25..07cb66d
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_agg_keys(AggState *asta
*** 81,87 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
--- 81,87 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 905,911 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 905,914 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1705,1711 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1708,1714 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_merge_append_keys(MergeAppendState 
*** 1719,1725 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1722,1728 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_agg_keys(AggState *astate, List *an
*** 1737,1743 ****
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
--- 1740,1746 ----
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, 0, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
*************** show_group_keys(GroupState *gstate, List
*** 1755,1761 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
--- 1758,1764 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
*************** show_group_keys(GroupState *gstate, List
*** 1765,1777 ****
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *result = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
--- 1768,1781 ----
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate,  const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *resultSort = NIL;
! 	List	   *resultPresorted = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
*************** show_sort_group_keys(PlanState *planstat
*** 1798,1807 ****
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 		result = lappend(result, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, result, es);
  }
  
  /*
--- 1802,1816 ----
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 
! 		if (keyno < nPresortedKeys)
! 			resultPresorted = lappend(resultPresorted, exprstr);
! 		resultSort = lappend(resultSort, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, resultSort, es);
! 	if (nPresortedKeys > 0)
! 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 09b2eb0..02dcd7a
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,52 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 	SortSupport sortKeys = tuplesort_get_sortkeys(node->tuplesortstate);
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = sortKeys[i].ssup_attno;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (ApplySortComparator(datumA, isnullA,
+                                   datumB, isnullB,
+                                   &sortKeys[i]))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 69,78 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,131 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
! 
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
! 		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
! 											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
- 			slot = ExecProcNode(outerNode);
- 
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
! 	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
--- 85,205 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
! 	tuplesortstate = tuplesort_begin_heap(tupDesc,
! 										  plannode->numCols,
! 										  plannode->sortColIdx,
! 										  plannode->sortOperators,
! 										  plannode->collations,
! 										  plannode->nullsFirst,
! 										  work_mem,
! 										  node->randomAccess);
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound);
! 	node->tuplesortstate = (void *) tuplesortstate;
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
! 		if (skipCols == 0)
  		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 	node->bound_Done = node->bound;
! 	SO1_printf("ExecSort: %s\n", "sorting done");
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 248,256 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
  
  	/*
  	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index e4184c5..b41213a
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 735,740 ****
--- 735,741 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 50f0852..1a38407
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1281,1295 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1281,1302 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1319,1331 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1326,1367 ----
  		output_bytes = input_bytes;
  	}
  
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1335,1341 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1371,1377 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1346,1355 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1382,1391 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1357,1368 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
--- 1393,1404 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1373,1380 ****
--- 1409,1423 ----
  	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
  	 * counting the LIMIT otherwise.
  	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
  	run_cost += cpu_operator_cost * tuples;
  
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2075,2080 ****
--- 2118,2125 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2101,2106 ****
--- 2146,2153 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index 5b477e5..5909dfe
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** sort_inner_and_outer(PlannerInfo *root,
*** 662,668 ****
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
--- 662,670 ----
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list,
! 														  NULL,
! 														  NULL);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
*************** match_unsorted_outer(PlannerInfo *root,
*** 832,837 ****
--- 834,840 ----
  		List	   *mergeclauses;
  		List	   *innersortkeys;
  		List	   *trialsortkeys;
+ 		List	   *outersortkeys;
  		Path	   *cheapest_startup_inner;
  		Path	   *cheapest_total_inner;
  		int			num_sortkeys;
*************** match_unsorted_outer(PlannerInfo *root,
*** 937,943 ****
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
--- 940,948 ----
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list,
! 													  joinrel,
! 													  &outersortkeys);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
*************** match_unsorted_outer(PlannerInfo *root,
*** 961,967 ****
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outerpath->pathkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
--- 966,972 ----
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outersortkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
*************** match_unsorted_outer(PlannerInfo *root,
*** 980,986 ****
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   NIL,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
--- 985,991 ----
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   outersortkeys,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1038,1044 ****
  		for (sortkeycnt = num_sortkeys; sortkeycnt > 0; sortkeycnt--)
  		{
  			Path	   *innerpath;
- 			List	   *newclauses = NIL;
  
  			/*
  			 * Look for an inner path ordered well enough for the first
--- 1043,1048 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1055,1073 ****
  				 compare_path_costs(innerpath, cheapest_total_inner,
  									TOTAL_COST) < 0))
  			{
- 				/* Found a cheap (or even-cheaper) sorted path */
- 				/* Select the right mergeclauses, if we didn't already */
- 				if (sortkeycnt < num_sortkeys)
- 				{
- 					newclauses =
- 						find_mergeclauses_for_pathkeys(root,
- 													   trialsortkeys,
- 													   false,
- 													   mergeclauses);
- 					Assert(newclauses != NIL);
- 				}
- 				else
- 					newclauses = mergeclauses;
  				try_mergejoin_path(root,
  								   joinrel,
  								   jointype,
--- 1059,1064 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1078,1086 ****
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   newclauses,
! 								   NIL,
! 								   NIL);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
--- 1069,1077 ----
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   mergeclauses,
! 								   outersortkeys,
! 								   innersortkeys);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1096,1119 ****
  				/* Found a cheap (or even-cheaper) sorted path */
  				if (innerpath != cheapest_total_inner)
  				{
- 					/*
- 					 * Avoid rebuilding clause list if we already made one;
- 					 * saves memory in big join trees...
- 					 */
- 					if (newclauses == NIL)
- 					{
- 						if (sortkeycnt < num_sortkeys)
- 						{
- 							newclauses =
- 								find_mergeclauses_for_pathkeys(root,
- 															   trialsortkeys,
- 															   false,
- 															   mergeclauses);
- 							Assert(newclauses != NIL);
- 						}
- 						else
- 							newclauses = mergeclauses;
- 					}
  					try_mergejoin_path(root,
  									   joinrel,
  									   jointype,
--- 1087,1092 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1124,1132 ****
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   newclauses,
! 									   NIL,
! 									   NIL);
  				}
  				cheapest_startup_inner = innerpath;
  			}
--- 1097,1105 ----
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   mergeclauses,
! 									   outersortkeys,
! 									   innersortkeys);
  				}
  				cheapest_startup_inner = innerpath;
  			}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9c8ede6..63c0b03
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static PathKey *make_canonical_pathkey(PlannerInfo *root,
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 313,344 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,373 ****
--- 395,421 ----
  	return matched_path;
  }
  
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0 ||
+ 			fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
*************** Path *
*** 386,411 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
--- 434,508 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							lfirst(list_head(key->pk_eclass->ec_members));
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 		if (n_common_pathkeys < matched_n_common_pathkeys ||
+ 				n_common_pathkeys == 0)
+ 			continue;
+ 
+ 		current_fraction = fraction;
+ 		if (n_common_pathkeys < n_pathkeys)
+ 		{
+ 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
+ 			current_fraction = Max(current_fraction, 1.0);
+ 		}
  
  		/*
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
  
! 		if ((
! 				n_common_pathkeys > matched_n_common_pathkeys
! 				||	(n_common_pathkeys == matched_n_common_pathkeys
! 					 && costs_cmp > 0)) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
  	return matched_path;
  }
*************** List *
*** 965,974 ****
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
--- 1062,1077 ----
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outersortkeys)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
+ 	bool	   *used = (bool *)palloc0(sizeof(bool) * list_length(restrictinfos));
+ 	int			k;
+ 	List	   *unusedRestrictinfos = NIL;
+ 	List	   *usedPathkeys = NIL;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1021,1026 ****
--- 1124,1130 ----
  		 * deal with the case in create_mergejoin_plan().
  		 *----------
  		 */
+ 		k = 0;
  		foreach(j, restrictinfos)
  		{
  			RestrictInfo *rinfo = (RestrictInfo *) lfirst(j);
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1033,1039 ****
--- 1137,1147 ----
  				clause_ec = rinfo->outer_is_left ?
  					rinfo->right_ec : rinfo->left_ec;
  			if (clause_ec == pathkey_ec)
+ 			{
  				matched_restrictinfos = lappend(matched_restrictinfos, rinfo);
+ 				used[k] = true;
+ 			}
+ 			k++;
  		}
  
  		/*
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1044,1049 ****
--- 1152,1159 ----
  		if (matched_restrictinfos == NIL)
  			break;
  
+ 		usedPathkeys = lappend(usedPathkeys, pathkey);
+ 
  		/*
  		 * If we did find usable mergeclause(s) for this sort-key position,
  		 * add them to result list.
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1051,1056 ****
--- 1161,1201 ----
  		mergeclauses = list_concat(mergeclauses, matched_restrictinfos);
  	}
  
+ 	if (outersortkeys)
+ 	{
+ 		List *addPathkeys, *addMergeclauses;
+ 
+ 		*outersortkeys = pathkeys;
+ 
+ 		if (!mergeclauses)
+ 			return mergeclauses;
+ 
+ 		k = 0;
+ 		foreach(i, restrictinfos)
+ 		{
+ 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);
+ 			if (!used[k])
+ 				unusedRestrictinfos = lappend(unusedRestrictinfos, rinfo);
+ 			k++;
+ 		}
+ 
+ 		if (!unusedRestrictinfos)
+ 			return mergeclauses;
+ 
+ 		addPathkeys = select_outer_pathkeys_for_merge(root,
+ 												unusedRestrictinfos, joinrel);
+ 
+ 		if (!addPathkeys)
+ 			return mergeclauses;
+ 
+ 		addMergeclauses = find_mergeclauses_for_pathkeys(root,
+ 				addPathkeys, true, unusedRestrictinfos, NULL, NULL);
+ 
+ 		*outersortkeys = list_concat(usedPathkeys, addPathkeys);
+ 		mergeclauses = list_concat(mergeclauses, addMergeclauses);
+ 
+ 	}
+ 
  	return mergeclauses;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1457,1472 ****
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
  
  	return 0;					/* path ordering not useful */
--- 1602,1621 ----
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
+ 	int n;
+ 
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n = pathkeys_common(root->query_pathkeys, pathkeys);
! 
! 	if (n != 0)
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return n;
  	}
  
  	return 0;					/* path ordering not useful */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index f2c122d..a300342
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 149,154 ****
--- 149,155 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 774,779 ****
--- 775,781 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 807,814 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 809,818 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2184,2192 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2188,2198 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2197,2205 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2203,2213 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 3739,3744 ****
--- 3747,3753 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3748,3754 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 3757,3764 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3762,3767 ****
--- 3772,3778 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4090,4096 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4101,4107 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4110,4116 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4121,4127 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4153,4159 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4164,4170 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4175,4181 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4186,4193 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4208,4214 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4220,4226 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 53fc238..4675402
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 494,500 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 494,502 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 1da4b2f..df5563a
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** grouping_planner(PlannerInfo *root, doub
*** 1349,1355 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1349,1357 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1365,1374 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1367,1380 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1378,1389 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1384,1418 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1464,1476 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1493,1508 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1564,1570 ****
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
--- 1596,1604 ----
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan,
! 													 root->group_pathkeys,
! 													n_common_pathkeys_grouping);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
*************** grouping_planner(PlannerInfo *root, doub
*** 1607,1613 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 1641,1649 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1724,1736 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 1760,1776 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 1876,1894 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 1916,1936 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 1904,1915 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 1946,1960 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** choose_hashed_grouping(PlannerInfo *root
*** 2654,2659 ****
--- 2699,2705 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 2735,2741 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2781,2788 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 2751,2759 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 2798,2809 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 2768,2777 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2818,2829 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2824,2829 ****
--- 2876,2882 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 2889,2895 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2942,2949 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2906,2928 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2960,2989 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 3712,3719 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 3773,3781 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e249628..b0b5471
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 859,865 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 859,866 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index a7169ef..3d0a842
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 971,980 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 971,981 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 988,993 ****
--- 989,996 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1343,1349 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1346,1353 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 52f05e6..6a09138
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** free_sort_tuple(Tuplesortstate *state, S
*** 3525,3527 ****
--- 3525,3534 ----
  	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
  	pfree(stup->tuple);
  }
+ 
+ SortSupport
+ tuplesort_get_sortkeys(Tuplesortstate *state)
+ {
+ 	return state->sortKeys;
+ }
+ 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 2a7b36e..76aab79
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1664,1671 ****
--- 1664,1673 ----
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	bool		finished;
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	HeapTuple	prev;
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 101e22c..28b871e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 582,587 ****
--- 582,588 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 444ab74..e98fb0c
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 88,95 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 88,96 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 999adaa..043641d
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 157,169 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 157,172 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
*************** extern void update_mergeclause_eclasses(
*** 185,191 ****
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
--- 188,196 ----
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outerpathkeys);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index ba7ae7c..d33c615
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 50,60 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 50,61 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5f87254..5a65cd2
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern void tuplesort_get_stats(Tuplesor
*** 111,116 ****
--- 112,119 ----
  
  extern int	tuplesort_merge_order(int64 allowedMem);
  
+ extern SortSupport tuplesort_get_sortkeys(Tuplesortstate *state);
+ 
  /*
   * These routines may only be called if randomAccess was specified 'true'.
   * Likewise, backwards scan in gettuple/getdatum is only allowed if
#20David Rowley
dgrowleyml@gmail.com
In reply to: Alexander Korotkov (#19)
Re: PoC: Partial sort

On Sun, Dec 29, 2013 at 4:51 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

I've compiled it with clang. Yeah, there was mixed declarations. I've
rechecked it with gcc, now it gives no warnings. I didn't try it with
visual studio, but I hope it will be OK.

Thanks for the patch. It now compiles without any problems.
I've been doing a bit of testing with the patch testing a few different
workloads. One thing that I've found is that in my test case when the table
only contains 1 tuple for any given presort columns that the query is
actually slower than when there are say 100 tuples to sort for any given
presort group.

Here is my test case:

DROP TABLE IF EXISTS temperature_readings;

CREATE TABLE temperature_readings (
readingid SERIAL NOT NULL,
timestamp TIMESTAMP NOT NULL,
locationid INT NOT NULL,
temperature INT NOT NULL,
PRIMARY KEY (readingid)
);

INSERT INTO temperature_readings (timestamp,locationid,temperature)
SELECT ts.timestamp, loc.locationid, -10 + random() * 40
FROM generate_series('1900-04-01','2000-04-01','1 day'::interval)
ts(timestamp)
CROSS JOIN generate_series(1,1) loc(locationid);

VACUUM ANALYZE temperature_readings;

-- Warm buffers
SELECT AVG(temperature) FROM temperature_readings;

explain (buffers, analyze) select * from temperature_readings order by
timestamp,locationid; -- (seqscan -> sort) 70.805ms

-- create an index to allow presorting on timestamp.
CREATE INDEX temperature_readings_timestamp_idx ON temperature_readings
(timestamp);

-- warm index buffers
SELECT COUNT(*) FROM (SELECT timestamp FROM temperature_readings ORDER BY
timestamp) c;

explain (buffers, analyze) select * from temperature_readings order by
timestamp,locationid; -- index scan -> partial sort 253.032ms

The first query without the index to presort on runs in 70.805 ms, the 2nd
query uses the index to presort and runs in 253.032 ms.

I ran the code through a performance profiler and found that about 80% of
the time is spent in tuplesort_end and tuplesort_begin_heap.

If it was possible to devise some way to reuse any previous tuplesortstate
perhaps just inventing a reset method which clears out tuples, then we
could see performance exceed the standard seqscan -> sort. The code the way
it is seems to lookup the sort functions from the syscache for each group
then allocate some sort space, so quite a bit of time is also spent in
palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number of
sort groups would be better so that the optimization is skipped if there
are too many sort groups.

Regards

David Rowley

Show quoted text

------
With best regards,
Alexander Korotkov.

#21Andreas Karlsson
andreas@proxel.se
In reply to: David Rowley (#20)
2 attachment(s)
Re: PoC: Partial sort

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

--
Andreas Karlsson

Attachments:

partial-sort-4-reset.patchtext/x-patch; name=partial-sort-4-reset.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 9969a25..07cb66d
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_agg_keys(AggState *asta
*** 81,87 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
--- 81,87 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 905,911 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 905,914 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1705,1711 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1708,1714 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_merge_append_keys(MergeAppendState
*** 1719,1725 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1722,1728 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_agg_keys(AggState *astate, List *an
*** 1737,1743 ****
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
--- 1740,1746 ----
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, 0, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
*************** show_group_keys(GroupState *gstate, List
*** 1755,1761 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
--- 1758,1764 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
*************** show_group_keys(GroupState *gstate, List
*** 1765,1777 ****
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *result = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
--- 1768,1781 ----
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate,  const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *resultSort = NIL;
! 	List	   *resultPresorted = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
*************** show_sort_group_keys(PlanState *planstat
*** 1798,1807 ****
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 		result = lappend(result, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, result, es);
  }
  
  /*
--- 1802,1816 ----
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 
! 		if (keyno < nPresortedKeys)
! 			resultPresorted = lappend(resultPresorted, exprstr);
! 		resultSort = lappend(resultSort, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, resultSort, es);
! 	if (nPresortedKeys > 0)
! 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 09b2eb0..c25ed7d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,52 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 	SortSupport sortKeys = tuplesort_get_sortkeys(node->tuplesortstate);
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = sortKeys[i].ssup_attno;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (ApplySortComparator(datumA, isnullA,
+                                   datumB, isnullB,
+                                   &sortKeys[i]))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 69,78 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,87 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
! 
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
--- 85,128 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
+ 	if (node->tuplesortstate != NULL)
+ 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ 	else
+ 	{
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
*************** ExecSort(SortState *node)
*** 93,131 ****
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
- 			slot = ExecProcNode(outerNode);
- 
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
! 	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
--- 134,208 ----
  		if (node->bounded)
  			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
! 		if (skipCols == 0)
  		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 	node->bound_Done = node->bound;
! 	SO1_printf("ExecSort: %s\n", "sorting done");
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 251,259 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
  
  	/*
  	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index e4184c5..b41213a
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 735,740 ****
--- 735,741 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 50f0852..1a38407
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan
*** 1281,1295 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1281,1302 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1319,1331 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1326,1367 ----
  		output_bytes = input_bytes;
  	}
  
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1335,1341 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1371,1377 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1346,1355 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1382,1391 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1357,1368 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
--- 1393,1404 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1373,1380 ****
--- 1409,1423 ----
  	 * here --- the upper LIMIT will pro-rate the run cost so we'd be double
  	 * counting the LIMIT otherwise.
  	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
  	run_cost += cpu_operator_cost * tuples;
  
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
  	path->startup_cost = startup_cost;
  	path->total_cost = startup_cost + run_cost;
  }
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2075,2080 ****
--- 2118,2125 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2101,2106 ****
--- 2146,2153 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index 5b477e5..5909dfe
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** sort_inner_and_outer(PlannerInfo *root,
*** 662,668 ****
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
--- 662,670 ----
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list,
! 														  NULL,
! 														  NULL);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
*************** match_unsorted_outer(PlannerInfo *root,
*** 832,837 ****
--- 834,840 ----
  		List	   *mergeclauses;
  		List	   *innersortkeys;
  		List	   *trialsortkeys;
+ 		List	   *outersortkeys;
  		Path	   *cheapest_startup_inner;
  		Path	   *cheapest_total_inner;
  		int			num_sortkeys;
*************** match_unsorted_outer(PlannerInfo *root,
*** 937,943 ****
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
--- 940,948 ----
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list,
! 													  joinrel,
! 													  &outersortkeys);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
*************** match_unsorted_outer(PlannerInfo *root,
*** 961,967 ****
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outerpath->pathkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
--- 966,972 ----
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outersortkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
*************** match_unsorted_outer(PlannerInfo *root,
*** 980,986 ****
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   NIL,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
--- 985,991 ----
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   outersortkeys,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1038,1044 ****
  		for (sortkeycnt = num_sortkeys; sortkeycnt > 0; sortkeycnt--)
  		{
  			Path	   *innerpath;
- 			List	   *newclauses = NIL;
  
  			/*
  			 * Look for an inner path ordered well enough for the first
--- 1043,1048 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1055,1073 ****
  				 compare_path_costs(innerpath, cheapest_total_inner,
  									TOTAL_COST) < 0))
  			{
- 				/* Found a cheap (or even-cheaper) sorted path */
- 				/* Select the right mergeclauses, if we didn't already */
- 				if (sortkeycnt < num_sortkeys)
- 				{
- 					newclauses =
- 						find_mergeclauses_for_pathkeys(root,
- 													   trialsortkeys,
- 													   false,
- 													   mergeclauses);
- 					Assert(newclauses != NIL);
- 				}
- 				else
- 					newclauses = mergeclauses;
  				try_mergejoin_path(root,
  								   joinrel,
  								   jointype,
--- 1059,1064 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1078,1086 ****
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   newclauses,
! 								   NIL,
! 								   NIL);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
--- 1069,1077 ----
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   mergeclauses,
! 								   outersortkeys,
! 								   innersortkeys);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1096,1119 ****
  				/* Found a cheap (or even-cheaper) sorted path */
  				if (innerpath != cheapest_total_inner)
  				{
- 					/*
- 					 * Avoid rebuilding clause list if we already made one;
- 					 * saves memory in big join trees...
- 					 */
- 					if (newclauses == NIL)
- 					{
- 						if (sortkeycnt < num_sortkeys)
- 						{
- 							newclauses =
- 								find_mergeclauses_for_pathkeys(root,
- 															   trialsortkeys,
- 															   false,
- 															   mergeclauses);
- 							Assert(newclauses != NIL);
- 						}
- 						else
- 							newclauses = mergeclauses;
- 					}
  					try_mergejoin_path(root,
  									   joinrel,
  									   jointype,
--- 1087,1092 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1124,1132 ****
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   newclauses,
! 									   NIL,
! 									   NIL);
  				}
  				cheapest_startup_inner = innerpath;
  			}
--- 1097,1105 ----
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   mergeclauses,
! 									   outersortkeys,
! 									   innersortkeys);
  				}
  				cheapest_startup_inner = innerpath;
  			}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 9c8ede6..63c0b03
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static PathKey *make_canonical_pathkey(PlannerInfo *root,
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 313,344 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,373 ****
--- 395,421 ----
  	return matched_path;
  }
  
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0 ||
+ 			fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
*************** Path *
*** 386,411 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
--- 434,508 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							lfirst(list_head(key->pk_eclass->ec_members));
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 		if (n_common_pathkeys < matched_n_common_pathkeys ||
+ 				n_common_pathkeys == 0)
+ 			continue;
+ 
+ 		current_fraction = fraction;
+ 		if (n_common_pathkeys < n_pathkeys)
+ 		{
+ 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
+ 			current_fraction = Max(current_fraction, 1.0);
+ 		}
  
  		/*
  		 * Since cost comparison is a lot cheaper than pathkey comparison, do
  		 * that first.	(XXX is that still true?)
  		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
  
! 		if ((
! 				n_common_pathkeys > matched_n_common_pathkeys
! 				||	(n_common_pathkeys == matched_n_common_pathkeys
! 					 && costs_cmp > 0)) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
  	return matched_path;
  }
*************** List *
*** 965,974 ****
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
--- 1062,1077 ----
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outersortkeys)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
+ 	bool	   *used = (bool *)palloc0(sizeof(bool) * list_length(restrictinfos));
+ 	int			k;
+ 	List	   *unusedRestrictinfos = NIL;
+ 	List	   *usedPathkeys = NIL;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1021,1026 ****
--- 1124,1130 ----
  		 * deal with the case in create_mergejoin_plan().
  		 *----------
  		 */
+ 		k = 0;
  		foreach(j, restrictinfos)
  		{
  			RestrictInfo *rinfo = (RestrictInfo *) lfirst(j);
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1033,1039 ****
--- 1137,1147 ----
  				clause_ec = rinfo->outer_is_left ?
  					rinfo->right_ec : rinfo->left_ec;
  			if (clause_ec == pathkey_ec)
+ 			{
  				matched_restrictinfos = lappend(matched_restrictinfos, rinfo);
+ 				used[k] = true;
+ 			}
+ 			k++;
  		}
  
  		/*
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1044,1049 ****
--- 1152,1159 ----
  		if (matched_restrictinfos == NIL)
  			break;
  
+ 		usedPathkeys = lappend(usedPathkeys, pathkey);
+ 
  		/*
  		 * If we did find usable mergeclause(s) for this sort-key position,
  		 * add them to result list.
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1051,1056 ****
--- 1161,1201 ----
  		mergeclauses = list_concat(mergeclauses, matched_restrictinfos);
  	}
  
+ 	if (outersortkeys)
+ 	{
+ 		List *addPathkeys, *addMergeclauses;
+ 
+ 		*outersortkeys = pathkeys;
+ 
+ 		if (!mergeclauses)
+ 			return mergeclauses;
+ 
+ 		k = 0;
+ 		foreach(i, restrictinfos)
+ 		{
+ 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);
+ 			if (!used[k])
+ 				unusedRestrictinfos = lappend(unusedRestrictinfos, rinfo);
+ 			k++;
+ 		}
+ 
+ 		if (!unusedRestrictinfos)
+ 			return mergeclauses;
+ 
+ 		addPathkeys = select_outer_pathkeys_for_merge(root,
+ 												unusedRestrictinfos, joinrel);
+ 
+ 		if (!addPathkeys)
+ 			return mergeclauses;
+ 
+ 		addMergeclauses = find_mergeclauses_for_pathkeys(root,
+ 				addPathkeys, true, unusedRestrictinfos, NULL, NULL);
+ 
+ 		*outersortkeys = list_concat(usedPathkeys, addPathkeys);
+ 		mergeclauses = list_concat(mergeclauses, addMergeclauses);
+ 
+ 	}
+ 
  	return mergeclauses;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1457,1472 ****
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
  	}
  
  	return 0;					/* path ordering not useful */
--- 1602,1621 ----
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
  {
+ 	int n;
+ 
  	if (root->query_pathkeys == NIL)
  		return 0;				/* no special ordering requested */
  
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	n = pathkeys_common(root->query_pathkeys, pathkeys);
! 
! 	if (n != 0)
  	{
  		/* It's useful ... or at least the first N keys are */
! 		return n;
  	}
  
  	return 0;					/* path ordering not useful */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 701fe78..8467e0d
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 149,154 ****
--- 149,155 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 774,779 ****
--- 775,781 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 807,814 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 809,818 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2181,2189 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2185,2195 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2194,2202 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2200,2210 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 3736,3741 ****
--- 3744,3750 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3745,3751 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 3754,3761 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3759,3764 ****
--- 3769,3775 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass
*** 4087,4093 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4098,4104 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4107,4113 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4118,4124 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4150,4156 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4161,4167 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4172,4178 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4183,4190 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4205,4211 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4217,4223 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 53fc238..4675402
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 494,500 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 494,502 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 1da4b2f..df5563a
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** grouping_planner(PlannerInfo *root, doub
*** 1349,1355 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1349,1357 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1365,1374 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1367,1380 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1378,1389 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1384,1418 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1464,1476 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1493,1508 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1564,1570 ****
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
--- 1596,1604 ----
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan,
! 													 root->group_pathkeys,
! 													n_common_pathkeys_grouping);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
*************** grouping_planner(PlannerInfo *root, doub
*** 1607,1613 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 1641,1649 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1724,1736 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 1760,1776 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 1876,1894 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 1916,1936 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 1904,1915 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 1946,1960 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** choose_hashed_grouping(PlannerInfo *root
*** 2654,2659 ****
--- 2699,2705 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 2735,2741 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2781,2788 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 2751,2759 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 2798,2809 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 2768,2777 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2818,2829 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2824,2829 ****
--- 2876,2882 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 2889,2895 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2942,2949 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2906,2928 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2960,2989 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid
*** 3712,3719 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 3773,3781 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e249628..b0b5471
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 859,865 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 859,866 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index a7169ef..3d0a842
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 971,980 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 971,981 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 988,993 ****
--- 989,996 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1343,1349 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1346,1353 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 52f05e6..8983251
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_end(Tuplesortstate *state)
*** 960,965 ****
--- 960,984 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
*************** free_sort_tuple(Tuplesortstate *state, S
*** 3525,3527 ****
--- 3544,3553 ----
  	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
  	pfree(stup->tuple);
  }
+ 
+ SortSupport
+ tuplesort_get_sortkeys(Tuplesortstate *state)
+ {
+ 	return state->sortKeys;
+ }
+ 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 2a7b36e..76aab79
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1664,1671 ****
--- 1664,1673 ----
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
+ 	bool		finished;
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	HeapTuple	prev;
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 101e22c..28b871e
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 582,587 ****
--- 582,588 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 444ab74..e98fb0c
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 88,95 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 88,96 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index dfe3a22..2b3313b
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 148,160 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 148,163 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
*************** extern void update_mergeclause_eclasses(
*** 176,182 ****
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
--- 179,187 ----
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outerpathkeys);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index ba7ae7c..d33c615
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 50,60 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 50,61 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5f87254..d5bc45e
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern bool tuplesort_skiptuples(Tupleso
*** 104,109 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
*************** extern void tuplesort_get_stats(Tuplesor
*** 111,116 ****
--- 114,121 ----
  
  extern int	tuplesort_merge_order(int64 allowedMem);
  
+ extern SortSupport tuplesort_get_sortkeys(Tuplesortstate *state);
+ 
  /*
   * These routines may only be called if randomAccess was specified 'true'.
   * Likewise, backwards scan in gettuple/getdatum is only allowed if
partial-sort-4-resetdiff.patchtext/x-patch; name=partial-sort-4-resetdiff.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 02dcd7a..c25ed7d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
*************** ExecSort(SortState *node)
*** 120,137 ****
  	tupDesc = ExecGetResultType(outerNode);
  
  	if (node->tuplesortstate != NULL)
! 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
! 	tuplesortstate = tuplesort_begin_heap(tupDesc,
! 										  plannode->numCols,
! 										  plannode->sortColIdx,
! 										  plannode->sortOperators,
! 										  plannode->collations,
! 										  plannode->nullsFirst,
! 										  work_mem,
! 										  node->randomAccess);
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound);
! 	node->tuplesortstate = (void *) tuplesortstate;
  
  	/*
  	 * Put next group of tuples where skipCols" sort values are equal to
--- 120,140 ----
  	tupDesc = ExecGetResultType(outerNode);
  
  	if (node->tuplesortstate != NULL)
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 	else
! 	{
! 		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
! 											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 		node->tuplesortstate = (void *) tuplesortstate;
! 	}
  
  	/*
  	 * Put next group of tuples where skipCols" sort values are equal to
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 6a09138..8983251
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_end(Tuplesortstate *state)
*** 960,965 ****
--- 960,984 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5a65cd2..d5bc45e
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern bool tuplesort_skiptuples(Tupleso
*** 105,110 ****
--- 105,112 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
#22David Rowley
dgrowleyml@gmail.com
In reply to: Andreas Karlsson (#21)
Re: PoC: Partial sort

On Tue, Dec 31, 2013 at 2:41 PM, Andreas Karlsson <andreas@proxel.se> wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for reusing
the tuplesort state. Can you try it and see if the performance regression
is fixed by this?

One thing which have to be fixed with my patch is that we probably want to
close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

Thanks, the attached is about 5 times faster than it was previously with my
test case upthread.

The times now look like:

No pre-sortable index:
Total runtime: 86.278 ms

With pre-sortable index with partial sorting
Total runtime: 47.500 ms

With the query where there is no index the sort remained in memory.

I spent some time trying to find a case where the partial sort is slower
than the seqscan -> sort. The only places partial sort seems slower are
when the number of estimated sort groups are around the crossover point
where the planner would be starting to think about performing a seqscan ->
sort instead. I'm thinking right now that it's not worth raising the costs
around this as the partial sort is less likely to become a disk sort than
the full sort is.

I'll keep going with trying to break it.

Regards

David Rowley

Show quoted text

--
Andreas Karlsson

#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#18)
Re: PoC: Partial sort

David Rowley escribi�:

I was about to test it tonight, but I'm having trouble getting the patch to
compile... I'm really wondering which compiler you are using as it seems
you're declaring your variables in some strange places.. See nodeSort.c
line 101. These variables are declared after there has been an if statement
in the same scope. That's not valid in C. (The patch did however apply
without any complaints).

AFAIR C99 allows mixed declarations and code. Visual Studio only
implements C89 though, which is why it fails to compile there.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andreas Karlsson
andreas@proxel.se
In reply to: Alexander Korotkov (#19)
Re: PoC: Partial sort

On 12/28/2013 04:51 PM, Alexander Korotkov wrote:

I've compiled it with clang. Yeah, there was mixed declarations. I've
rechecked it with gcc, now it gives no warnings. I didn't try it with
visual studio, but I hope it will be OK.

I looked at this version of the patch and noticed a possibility for
improvement. You could decrement the bound for the tuplesort after every
completed sort. Otherwise the optimizations for small limits wont apply
to partial sort.

--
Andreas Karlsson

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andreas Karlsson (#21)
1 attachment(s)
Re: PoC: Partial sort

On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson <andreas@proxel.se> wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for reusing
the tuplesort state. Can you try it and see if the performance regression
is fixed by this?

One thing which have to be fixed with my patch is that we probably want to
close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-5.patch.gzapplication/x-gzip; name=partial-sort-5.patch.gzDownload
#26Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#25)
1 attachment(s)
Re: PoC: Partial sort

Hi Alexander,

First, thanks a lot for working on this feature. This PostgreSQL
shortcoming crops up in all the time in web applications that implement
paging by multiple sorted columns.

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's a
lot easier to test. The attached patch applies on top of
partial-sort-5.patch

I will spend more time reviewing the patch, but some of this planner code
is over my head. If there's any way I can help to make sure this lands in
the next version, let me know.

----

The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new
Sort steps into subplans. For example, if there's no index on
language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (
l.id=r.language) order by l.name, r.name limit 1000;
Limit (cost=123203.20..123205.70 rows=1000 width=32)
-> Sort (cost=123203.20..126154.27 rows=1180430 width=32)
Sort Key: l.name, r.name
-> Hash Join (cost=229.47..58481.49 rows=1180430 width=32)
Hash Cond: (r.language = l.id)
-> Seq Scan on release r (cost=0.00..31040.10 rows=1232610
width=26)
-> Hash (cost=131.43..131.43 rows=7843 width=14)
-> Seq Scan on language l (cost=0.00..131.43
rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to
sort languages in advance and then do partial sort:
Limit (rows=1000 width=31)
-> Partial sort (rows=1180881 width=31)
Sort Key: l.name, r.name
Presorted Key: l.name
-> Nested Loop (rows=1180881 width=31)
-> Sort (rows=7843 width=10)
Sort Key: name
-> Seq Scan on language (rows=7843 width=14)
-> Index Scan using release_language_idx on release r
(rows=11246 width=25)
Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by
name)
marti-# select l.name, r.name from sorted_lang l join release r on
(l.id=r.language)
order by l.name, r.name limit 1000;
Limit (cost=3324368.83..3324371.33 rows=1000 width=240)
CTE sorted_lang
-> Sort (cost=638.76..658.37 rows=7843 width=14)
Sort Key: language.name
-> Seq Scan on language (cost=0.00..131.43 rows=7843 width=14)
-> Sort (cost=3323710.46..3439436.82 rows=46290543 width=240)
Sort Key: l.name, r.name
-> Merge Join (cost=664.62..785649.92 rows=46290543 width=240)
Merge Cond: (r.language = l.id)
-> Index Scan using release_language_idx on release r
(cost=0.43..87546.06 rows=1232610 width=26)
-> Sort (cost=664.19..683.80 rows=7843 width=222)
Sort Key: l.id
-> CTE Scan on sorted_lang l (cost=0.00..156.86
rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of
the next release, for me at least.

Regards,
Marti

On Mon, Jan 13, 2014 at 8:01 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

Show quoted text

On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson <andreas@proxel.se>wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for reusing
the tuplesort state. Can you try it and see if the performance regression
is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

Thanks. It's included into attached version of patch. As wall as
estimation improvements, more comments and regression tests fix.

------
With best regards,
Alexander Korotkov.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachments:

0001-Add-enable_partialsort-GUC-for-disabling-partial-sor.patchtext/x-patch; charset=US-ASCII; name=0001-Add-enable_partialsort-GUC-for-disabling-partial-sor.patchDownload
From 3f05447e7feb99583336b381df60ff013a144bab Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <marti@juffo.org>
Date: Mon, 13 Jan 2014 22:24:20 +0200
Subject: [PATCH] Add enable_partialsort GUC for disabling partial sorts

---
 doc/src/sgml/config.sgml                      | 13 +++++++++++++
 src/backend/optimizer/path/costsize.c         |  3 ++-
 src/backend/optimizer/path/pathkeys.c         |  1 +
 src/backend/utils/misc/guc.c                  | 10 ++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/optimizer/cost.h                  |  1 +
 6 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0f2f2bf..1995625 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2808,6 +2808,19 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-partialsort" xreflabel="enable_partialsort">
+      <term><varname>enable_partialsort</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>enable_partialsort</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of partial sort steps.
+        The default is <literal>on</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-tidscan" xreflabel="enable_tidscan">
       <term><varname>enable_tidscan</varname> (<type>boolean</type>)</term>
       <indexterm>
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index da64825..cefd480 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -112,6 +112,7 @@ bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
+bool		enable_partialsort = true;
 bool		enable_hashagg = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
@@ -1329,7 +1330,7 @@ cost_sort(Path *path, PlannerInfo *root,
 	/*
 	 * Estimate number of groups which dataset is divided by presorted keys.
 	 */
-	if (presorted_keys > 0)
+	if (presorted_keys > 0 && enable_partialsort)
 	{
 		List *groupExprs = NIL;
 		ListCell *l;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 55d8ef4..d5a1357 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -22,6 +22,7 @@
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/clauses.h"
+#include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..c3f2f29 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1,3 +1,4 @@
+
 /*--------------------------------------------------------------------
  * guc.c
  *
@@ -724,6 +725,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_partialsort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of partial sort steps."),
+			NULL
+		},
+		&enable_partialsort,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of hashed aggregation plans."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 27791cc..20072fb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -270,6 +270,7 @@
 #enable_nestloop = on
 #enable_seqscan = on
 #enable_sort = on
+#enable_partialsort = on
 #enable_tidscan = on
 
 # - Planner Cost Constants -
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 47aef12..30203c7 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -56,6 +56,7 @@ extern bool enable_indexonlyscan;
 extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
+extern bool enable_partialsort;
 extern bool enable_hashagg;
 extern bool enable_nestloop;
 extern bool enable_material;
-- 
1.8.5.2

#27Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#26)
Re: PoC: Partial sort

Hi!

On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti@juffo.org> wrote:

First, thanks a lot for working on this feature. This PostgreSQL
shortcoming crops up in all the time in web applications that implement
paging by multiple sorted columns.

Thanks!

I've been trying it out in a few situations. I implemented a new

enable_partialsort GUC to make it easier to turn on/off, this way it's a
lot easier to test. The attached patch applies on top of
partial-sort-5.patch

I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :) For instance, merge join rely on partial sort which will be
replaced with simple sort.

I will spend more time reviewing the patch, but some of this planner code
is over my head. If there's any way I can help to make sure this lands in
the next version, let me know.

----

The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new
Sort steps into subplans. For example, if there's no index on
language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (
l.id=r.language) order by l.name, r.name limit 1000;
Limit (cost=123203.20..123205.70 rows=1000 width=32)
-> Sort (cost=123203.20..126154.27 rows=1180430 width=32)
Sort Key: l.name, r.name
-> Hash Join (cost=229.47..58481.49 rows=1180430 width=32)
Hash Cond: (r.language = l.id)
-> Seq Scan on release r (cost=0.00..31040.10
rows=1232610 width=26)
-> Hash (cost=131.43..131.43 rows=7843 width=14)
-> Seq Scan on language l (cost=0.00..131.43
rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to
sort languages in advance and then do partial sort:
Limit (rows=1000 width=31)
-> Partial sort (rows=1180881 width=31)
Sort Key: l.name, r.name
Presorted Key: l.name
-> Nested Loop (rows=1180881 width=31)
-> Sort (rows=7843 width=10)
Sort Key: name
-> Seq Scan on language (rows=7843 width=14)
-> Index Scan using release_language_idx on release r
(rows=11246 width=25)
Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order
by name)
marti-# select l.name, r.name from sorted_lang l join release r on (l.id=r.language)
order by l.name, r.name limit 1000;
Limit (cost=3324368.83..3324371.33 rows=1000 width=240)
CTE sorted_lang
-> Sort (cost=638.76..658.37 rows=7843 width=14)
Sort Key: language.name
-> Seq Scan on language (cost=0.00..131.43 rows=7843 width=14)
-> Sort (cost=3323710.46..3439436.82 rows=46290543 width=240)
Sort Key: l.name, r.name
-> Merge Join (cost=664.62..785649.92 rows=46290543 width=240)
Merge Cond: (r.language = l.id)
-> Index Scan using release_language_idx on release r
(cost=0.43..87546.06 rows=1232610 width=26)
-> Sort (cost=664.19..683.80 rows=7843 width=222)
Sort Key: l.id
-> CTE Scan on sorted_lang l (cost=0.00..156.86
rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of
the next release, for me at least.

I see. But I don't think it can be achieved by small changes in planner.
Moreover, I didn't check but I think if you remove ordering by r.name you
will still not get sorting languages in the inner node. So, this problem is
not directly related to partial sort.

------
With best regards,
Alexander Korotkov.

#28Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#27)
Re: PoC: Partial sort

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

I implemented a new
enable_partialsort GUC to make it easier to turn on/off

I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :) For instance, merge join rely on partial sort which will be
replaced with simple sort.

Oh, this actually highlights a performance regression with the partial sort
patch. I assumed the planner will discard the full sort because of higher
costs, but it looks like the new code always assumes that a Partial sort
will be cheaper than a Join Filter without considering costs. When doing a
join USING (unique_indexed_value, something), the new plan is significantly
worse.

Unpatched:
marti=# explain analyze select * from release a join release b using (id,
name);
Merge Join (cost=0.85..179810.75 rows=12 width=158) (actual
time=0.011..1279.596 rows=1232610 loops=1)
Merge Cond: (a.id = b.id)
Join Filter: ((a.name)::text = (b.name)::text)
-> Index Scan using release_id_idx on release a (cost=0.43..79120.04
rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
-> Index Scan using release_id_idx on release b (cost=0.43..79120.04
rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
Total runtime: 1309.049 ms

Patched:
Merge Join (cost=0.98..179810.87 rows=12 width=158) (actual
time=0.037..5034.158 rows=1232610 loops=1)
Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
-> Partial sort (cost=0.49..82201.56 rows=1232610 width=92) (actual
time=0.013..955.938 rows=1232610 loops=1)
Sort Key: a.id, a.name
Presorted Key: a.id
Sort Method: quicksort Memory: 25kB
-> Index Scan using release_id_idx on release a
(cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332
rows=1232610 loops=1)
-> Materialize (cost=0.49..85283.09 rows=1232610 width=92) (actual
time=0.019..1352.377 rows=1232610 loops=1)
-> Partial sort (cost=0.49..82201.56 rows=1232610 width=92)
(actual time=0.018..1223.251 rows=1232610 loops=1)
Sort Key: b.id, b.name
Presorted Key: b.id
Sort Method: quicksort Memory: 25kB
-> Index Scan using release_id_idx on release b
(cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258
rows=1232610 loops=1)
Total runtime: 5166.906 ms
----

There's another "wishlist" kind of thing with top-N heapsort bounds; if I
do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound
set to 1000, but it could be reduced after each batch. If the first batch
is 900 rows then the 2nd batch only needs the top 100 rows at most.

Also, I find the name "partial sort" a bit confusing; this feature is not
actually sorting *partially*, it's finishing the sort of partially-sorted
data. Perhaps "batched sort" would explain the feature better? Because it
does the sort in multiple batches instead of all at once. But maybe that's
just me.

Regards,
Marti

#29Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#28)
Re: PoC: Partial sort

On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

I implemented a new
enable_partialsort GUC to make it easier to turn on/off

I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :) For instance, merge join rely on partial sort which will be
replaced with simple sort.

Oh, this actually highlights a performance regression with the partial
sort patch. I assumed the planner will discard the full sort because of
higher costs, but it looks like the new code always assumes that a Partial
sort will be cheaper than a Join Filter without considering costs. When
doing a join USING (unique_indexed_value, something), the new plan is
significantly worse.

Unpatched:
marti=# explain analyze select * from release a join release b using (id,
name);
Merge Join (cost=0.85..179810.75 rows=12 width=158) (actual
time=0.011..1279.596 rows=1232610 loops=1)
Merge Cond: (a.id = b.id)
Join Filter: ((a.name)::text = (b.name)::text)
-> Index Scan using release_id_idx on release a (cost=0.43..79120.04
rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
-> Index Scan using release_id_idx on release b (cost=0.43..79120.04
rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
Total runtime: 1309.049 ms

Patched:
Merge Join (cost=0.98..179810.87 rows=12 width=158) (actual
time=0.037..5034.158 rows=1232610 loops=1)
Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
-> Partial sort (cost=0.49..82201.56 rows=1232610 width=92) (actual
time=0.013..955.938 rows=1232610 loops=1)
Sort Key: a.id, a.name
Presorted Key: a.id
Sort Method: quicksort Memory: 25kB
-> Index Scan using release_id_idx on release a
(cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332
rows=1232610 loops=1)
-> Materialize (cost=0.49..85283.09 rows=1232610 width=92) (actual
time=0.019..1352.377 rows=1232610 loops=1)
-> Partial sort (cost=0.49..82201.56 rows=1232610 width=92)
(actual time=0.018..1223.251 rows=1232610 loops=1)
Sort Key: b.id, b.name
Presorted Key: b.id
Sort Method: quicksort Memory: 25kB
-> Index Scan using release_id_idx on release b
(cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258
rows=1232610 loops=1)
Total runtime: 5166.906 ms
----

Interesting. Could you share the dataset?

There's another "wishlist" kind of thing with top-N heapsort bounds; if I

do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound
set to 1000, but it could be reduced after each batch. If the first batch
is 900 rows then the 2nd batch only needs the top 100 rows at most.

Right. Just didn't implement it yet.

Also, I find the name "partial sort" a bit confusing; this feature is not
actually sorting *partially*, it's finishing the sort of partially-sorted
data. Perhaps "batched sort" would explain the feature better? Because it
does the sort in multiple batches instead of all at once. But maybe that's
just me.

I'm not sure. For me "batched sort" sounds like we're going to sort in
batch something that we sorted separately before. Probably I'm wrong
because I'm far from native english :)

------
With best regards,
Alexander Korotkov.

#30Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#29)
Re: PoC: Partial sort

On Tue, Jan 14, 2014 at 9:28 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti@juffo.org> wrote:

Oh, this actually highlights a performance regression with the partial
sort patch.

Interesting. Could you share the dataset?

It occurs with many datasets if work_mem is sufficiently low (10MB in my
case). Here's a quicker way to reproduce a similar issue:

create table foo as select i, i as j from generate_series(1,10000000) i;
create index on foo(i);
explain analyze select * from foo a join foo b using (i, j);

The real data is from the "release" table from MusicBrainz database dump:
https://musicbrainz.org/doc/MusicBrainz_Database/Download . It's nontrivial
to set up though, so if you still need the real data, I can upload a pgdump
for you.

Regards,
Marti

#31Jeremy Harris
jgh@wizmail.org
In reply to: Alexander Korotkov (#15)
Re: PoC: Partial sort

On 22/12/13 20:26, Alexander Korotkov wrote:

On Sat, Dec 14, 2013 at 6:30 PM, Jeremy Harris <jgh@wizmail.org> wrote:

On 14/12/13 12:54, Andres Freund wrote:

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

Eg: /messages/by-id/5291467E.6070807@wizmail.org

Maybe Alexander and I should bash our heads together.

Partial sort patch is mostly optimizer/executor improvement rather than
improvement of sort algorithm itself.

I finally got as far as understanding Alexander's cleverness, and it
does make the performance advantage (on partially-sorted input) of the
merge-sort irrelevant.

There's a slight tradeoff possible between the code complexity of
the chunking code front-ending the sorter and just using the
enhanced sorter. The chunking does reduce the peak memory usage
quite nicely too.

The implementation of the chunker does O(n) compares using the
keys of the feed-stream index, to identify the chunk boundaries.
Would it be possible to get this information from the Index Scan?
--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Jeremy Harris
jgh@wizmail.org
In reply to: Alexander Korotkov (#25)
Re: PoC: Partial sort

On 13/01/14 18:01, Alexander Korotkov wrote:

Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.

Would it be possible to totally separate the two sets of sort-keys,
only giving the non-index set to the tuplesort? At present tuplesort
will, when it has a group to sort, make wasted compares on the
indexed set of keys before starting on the non-indexed set.
--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#25)
1 attachment(s)
Re: PoC: Partial sort

Hi,

There's another small regression with this patch when used with expensive
comparison functions, such as long text fields.

If we go through all this trouble in cmpSortSkipCols to prove that the
first N sortkeys are equal, it would be nice if Tuplesort could skip their
comparisons entirely; that's another nice optimization this patch can
provide.

I've implemented that in the attached patch, which applies on top of your
partial-sort-5.patch

Should the "Sort Key" field in EXPLAIN output be changed as well? I'd say
no, I think that makes the partial sort steps harder to read.

Generate test data:
create table longtext as select (select repeat('a', 1000*100)) a,
generate_series(1,1000) i;
create index on longtext(a);

Unpatched (using your original partial-sort-5.patch):
=# explain analyze select * from longtext order by a, i limit 10;
Limit (cost=2.34..19.26 rows=10 width=1160) (actual time=13477.739..13477.756
rows=10 loops=1)
-> Partial sort (cost=2.34..1694.15 rows=1000 width=1160) (actual time=
13477.737..13477.742 rows=10 loops=1)
Sort Key: a, i
Presorted Key: a
Sort Method: top-N heapsort Memory: 45kB
-> Index Scan using longtext_a_idx on longtext (cost=0.65..1691.65
rows=1000 width=1160) (actual time=0.015..2.364 rows=1000 loops=1)
Total runtime: 13478.158 ms
(7 rows)

=# set enable_indexscan=off;
=# explain analyze select * from longtext order by a, i limit 10;
Limit (cost=198.61..198.63 rows=10 width=1160) (actual
time=6970.439..6970.458 rows=10 loops=1)
-> Sort (cost=198.61..201.11 rows=1000 width=1160) (actual
time=6970.438..6970.444 rows=10 loops=1)
Sort Key: a, i
Sort Method: top-N heapsort Memory: 45kB
-> Seq Scan on longtext (cost=0.00..177.00 rows=1000 width=1160)
(actual time=0.007..1.763 rows=1000 loops=1)
Total runtime: 6970.491 ms

Patched:
=# explain analyze select * from longtext order by a, i ;
Partial sort (cost=2.34..1694.15 rows=1000 width=1160) (actual
time=0.024..4.603 rows=1000 loops=1)
Sort Key: a, i
Presorted Key: a
Sort Method: quicksort Memory: 27kB
-> Index Scan using longtext_a_idx on longtext (cost=0.65..1691.65
rows=1000 width=1160) (actual time=0.013..2.094 rows=1000 loops=1)
Total runtime: 5.418 ms

Regards,
Marti

Attachments:

0001-Batched-sort-skip-comparisons-for-known-equal-column.patchtext/x-patch; charset=US-ASCII; name=0001-Batched-sort-skip-comparisons-for-known-equal-column.patchDownload
From fbc6c31528018bca64dc54f65e1cd292f8de482a Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <marti@juffo.org>
Date: Sat, 18 Jan 2014 19:16:15 +0200
Subject: [PATCH] Batched sort: skip comparisons for known-equal columns

---
 src/backend/executor/nodeSort.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index cf1f79e..5abda1d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -125,10 +125,10 @@ ExecSort(SortState *node)
 	{
 		tuplesortstate = tuplesort_begin_heap(tupDesc,
 											  plannode->numCols - skipCols,
-											  &(plannode->sortColIdx)[skipCols],
-											  plannode->sortOperators,
-											  plannode->collations,
-											  plannode->nullsFirst,
+											  &(plannode->sortColIdx[skipCols]),
+											  &(plannode->sortOperators[skipCols]),
+											  &(plannode->collations[skipCols]),
+											  &(plannode->nullsFirst[skipCols]),
 											  work_mem,
 											  node->randomAccess);
 		if (node->bounded)
-- 
1.8.5.3

#34Marti Raudsepp
marti@juffo.org
In reply to: Jeremy Harris (#32)
Re: PoC: Partial sort

Funny, I just wrote a patch to do that some minutes ago (didn't see your
email yet).

/messages/by-id/CABRT9RCK=wmFUYZdqU_+MOFW5PDevLxJmZ5B=eTJJNUBvyARxw@mail.gmail.com

Regards,
Marti

On Sat, Jan 18, 2014 at 7:10 PM, Jeremy Harris <jgh@wizmail.org> wrote:

Show quoted text

On 13/01/14 18:01, Alexander Korotkov wrote:

Thanks. It's included into attached version of patch. As wall as
estimation
improvements, more comments and regression tests fix.

Would it be possible to totally separate the two sets of sort-keys,
only giving the non-index set to the tuplesort? At present tuplesort
will, when it has a group to sort, make wasted compares on the
indexed set of keys before starting on the non-indexed set.
--
Cheers,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Jeremy Harris
jgh@wizmail.org
In reply to: Andreas Karlsson (#21)
Re: PoC: Partial sort

On 31/12/13 01:41, Andreas Karlsson wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from
ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

How does this work in combination with randomAccess ?
--
Thanks,
Jeremy

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Marti Raudsepp
marti@juffo.org
In reply to: Marti Raudsepp (#33)
1 attachment(s)
Re: PoC: Partial sort

On Sat, Jan 18, 2014 at 7:22 PM, Marti Raudsepp <marti@juffo.org> wrote:

Total runtime: 5.418 ms

Oops, shouldn't have rushed this. Clearly the timings should have
tipped me off that it's broken. I didn't notice that cmpSortSkipCols
was re-using tuplesort's sortkeys.

Here's a patch that actually works; I added a new skipKeys attribute
to SortState. I had to extract the SortSupport-creation code from
tuplesort_begin_heap to a new function; but that's fine, because it
was already duplicated in ExecInitMergeAppend too.

I reverted the addition of tuplesort_get_sortkeys, which is not needed now.

Now the timings are:
Unpatched partial sort: 13478.158 ms
Full sort: 6802.063 ms
Patched partial sort: 6618.962 ms

Regards,
Marti

Attachments:

0001-Partial-sort-skip-comparisons-for-known-equal-column.patchtext/x-patch; charset=US-ASCII; name=0001-Partial-sort-skip-comparisons-for-known-equal-column.patchDownload
From 7d9f34c09e7836504725ff11be7e63a2fc438ae9 Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <marti@juffo.org>
Date: Mon, 13 Jan 2014 20:38:45 +0200
Subject: [PATCH] Partial sort: skip comparisons for known-equal columns

---
 src/backend/executor/nodeMergeAppend.c | 18 +++++-------------
 src/backend/executor/nodeSort.c        | 26 +++++++++++++++++---------
 src/backend/utils/sort/sortsupport.c   | 29 +++++++++++++++++++++++++++++
 src/backend/utils/sort/tuplesort.c     | 31 +++++--------------------------
 src/include/nodes/execnodes.h          |  1 +
 src/include/utils/sortsupport.h        |  3 +++
 src/include/utils/tuplesort.h          |  2 --
 7 files changed, 60 insertions(+), 50 deletions(-)

diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 74fa40d..db6ec23 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -126,19 +126,11 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 * initialize sort-key information
 	 */
 	mergestate->ms_nkeys = node->numCols;
-	mergestate->ms_sortkeys = palloc0(sizeof(SortSupportData) * node->numCols);
-
-	for (i = 0; i < node->numCols; i++)
-	{
-		SortSupport sortKey = mergestate->ms_sortkeys + i;
-
-		sortKey->ssup_cxt = CurrentMemoryContext;
-		sortKey->ssup_collation = node->collations[i];
-		sortKey->ssup_nulls_first = node->nullsFirst[i];
-		sortKey->ssup_attno = node->sortColIdx[i];
-
-		PrepareSortSupportFromOrderingOp(node->sortOperators[i], sortKey);
-	}
+	mergestate->ms_sortkeys = MakeSortSupportKeys(mergestate->ms_nkeys,
+												  node->sortColIdx,
+												  node->sortOperators,
+												  node->collations,
+												  node->nullsFirst);
 
 	/*
 	 * initialize to show we have not run the subplans yet
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 55cdb05..7645645 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -28,20 +28,19 @@ static bool
 cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
 {
 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
-	SortSupport sortKeys = tuplesort_get_sortkeys(node->tuplesortstate);
 
 	for (i = 0; i < n; i++)
 	{
 		Datum datumA, datumB;
 		bool isnullA, isnullB;
-		AttrNumber attno = sortKeys[i].ssup_attno;
+		AttrNumber attno = node->skipKeys[i].ssup_attno;
 
 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
 		datumB = slot_getattr(b, attno, &isnullB);
 
 		if (ApplySortComparator(datumA, isnullA,
-                                  datumB, isnullB,
-                                  &sortKeys[i]))
+								datumB, isnullB,
+								&node->skipKeys[i]))
 			return false;
 	}
 	return true;
@@ -123,12 +122,21 @@ ExecSort(SortState *node)
 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
 	else
 	{
+		/* Support structures for cmpSortSkipCols - already sorted columns */
+		if (skipCols)
+			node->skipKeys = MakeSortSupportKeys(skipCols,
+												 plannode->sortColIdx,
+												 plannode->sortOperators,
+												 plannode->collations,
+												 plannode->nullsFirst);
+
+		/* Only pass on remaining columns that are unsorted */
 		tuplesortstate = tuplesort_begin_heap(tupDesc,
-											  plannode->numCols,
-											  plannode->sortColIdx,
-											  plannode->sortOperators,
-											  plannode->collations,
-											  plannode->nullsFirst,
+											  plannode->numCols - skipCols,
+											  &(plannode->sortColIdx[skipCols]),
+											  &(plannode->sortOperators[skipCols]),
+											  &(plannode->collations[skipCols]),
+											  &(plannode->nullsFirst[skipCols]),
 											  work_mem,
 											  node->randomAccess);
 		if (node->bounded)
diff --git a/src/backend/utils/sort/sortsupport.c b/src/backend/utils/sort/sortsupport.c
index 347f448..df82f5f 100644
--- a/src/backend/utils/sort/sortsupport.c
+++ b/src/backend/utils/sort/sortsupport.c
@@ -85,6 +85,35 @@ PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup)
 }
 
 /*
+ * Build an array of SortSupportData structures from separated arrays.
+ */
+SortSupport
+MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+					Oid *sortOperators, Oid *sortCollations,
+					bool *nullsFirstFlags)
+{
+	SortSupport sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
+	int			i;
+
+	for (i = 0; i < nkeys; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+
+		AssertArg(attNums[i] != 0);
+		AssertArg(sortOperators[i] != 0);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = sortCollations[i];
+		sortKey->ssup_nulls_first = nullsFirstFlags[i];
+		sortKey->ssup_attno = attNums[i];
+
+		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
+	}
+
+	return sortKeys;
+}
+
+/*
  * Fill in SortSupport given an ordering operator (btree "<" or ">" operator).
  *
  * Caller must previously have zeroed the SortSupportData structure and then
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb5a9f..738f7a1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -604,7 +604,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
-	int			i;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
@@ -632,24 +631,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->reversedirection = reversedirection_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
-
-	/* Prepare SortSupport data for each column */
-	state->sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
-
-	for (i = 0; i < nkeys; i++)
-	{
-		SortSupport sortKey = state->sortKeys + i;
-
-		AssertArg(attNums[i] != 0);
-		AssertArg(sortOperators[i] != 0);
-
-		sortKey->ssup_cxt = CurrentMemoryContext;
-		sortKey->ssup_collation = sortCollations[i];
-		sortKey->ssup_nulls_first = nullsFirstFlags[i];
-		sortKey->ssup_attno = attNums[i];
-
-		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
-	}
+	state->sortKeys = MakeSortSupportKeys(nkeys,
+										  attNums,
+										  sortOperators,
+										  sortCollations,
+										  nullsFirstFlags);
 
 	if (nkeys == 1)
 		state->onlyKey = state->sortKeys;
@@ -3544,10 +3530,3 @@ free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
 	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 	pfree(stup->tuple);
 }
-
-SortSupport
-tuplesort_get_sortkeys(Tuplesortstate *state)
-{
-	return state->sortKeys;
-}
-
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9fa1823..13a4f0f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1671,6 +1671,7 @@ typedef struct SortState
 	bool		finished;
 	int64		bound_Done;		/* value of bound we did the sort with */
 	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SortSupport skipKeys;		/* columns already sorted in input */
 	HeapTuple	prev;
 } SortState;
 
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 13d3fbe..cd48a45 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -150,6 +150,9 @@ ApplySortComparator(Datum datum1, bool isNull1,
 #endif   /*-- PG_USE_INLINE || SORTSUPPORT_INCLUDE_DEFINITIONS */
 
 /* Other functions in utils/sort/sortsupport.c */
+extern SortSupport MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+					Oid *sortOperators, Oid *sortCollations,
+					bool *nullsFirstFlags);
 extern void PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup);
 extern void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup);
 
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 106c3fd..eb882d3 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -114,8 +114,6 @@ extern void tuplesort_get_stats(Tuplesortstate *state,
 
 extern int	tuplesort_merge_order(int64 allowedMem);
 
-extern SortSupport tuplesort_get_sortkeys(Tuplesortstate *state);
-
 /*
  * These routines may only be called if randomAccess was specified 'true'.
  * Likewise, backwards scan in gettuple/getdatum is only allowed if
-- 
1.8.5.3

#37Andreas Karlsson
andreas@proxel.se
In reply to: Jeremy Harris (#35)
Re: PoC: Partial sort

On 01/18/2014 08:13 PM, Jeremy Harris wrote:

On 31/12/13 01:41, Andreas Karlsson wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from
ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

How does this work in combination with randomAccess ?

As far as I can tell randomAccess was broken by the partial sort patch
even before my change since it would not iterate over multiple
tuplesorts anyway.

Alexander: Is this true or am I missing something?

--
Andreas Karlsson

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andreas Karlsson (#37)
1 attachment(s)
Re: PoC: Partial sort

On Sun, Jan 19, 2014 at 5:57 AM, Andreas Karlsson <andreas@proxel.se> wrote:

On 01/18/2014 08:13 PM, Jeremy Harris wrote:

On 31/12/13 01:41, Andreas Karlsson wrote:

On 12/29/2013 08:24 AM, David Rowley wrote:

If it was possible to devise some way to reuse any
previous tuplesortstate perhaps just inventing a reset method which
clears out tuples, then we could see performance exceed the standard
seqscan -> sort. The code the way it is seems to lookup the sort
functions from the syscache for each group then allocate some sort
space, so quite a bit of time is also spent in palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number
of sort groups would be better so that the optimization is skipped if
there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from
ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

How does this work in combination with randomAccess ?

As far as I can tell randomAccess was broken by the partial sort patch
even before my change since it would not iterate over multiple tuplesorts
anyway.

Alexander: Is this true or am I missing something?

Yes, I decided that Sort node shouldn't provide randomAccess in the case of
skipCols !=0. See assert in the beginning of ExecInitSort. I decided that
it would be better to add explicit materialize node rather than store extra
tuples in tuplesortstate each time.
I also adjusted ExecSupportsMarkRestore, ExecMaterializesOutput and
ExecMaterializesOutput to make planner believe so. I found path->pathtype
to be absolutely never T_Sort. Correct me if I'm wrong.

Another changes in this version of patch:
1) Applied patch to don't compare skipCols in tuplesort by Marti Raudsepp
2) Adjusting sort bound after processing buckets.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-6.patch.gzapplication/x-gzip; name=partial-sort-6.patch.gzDownload
���R�}isG��g�W�bmPlB���Z�G;�V����;�h
���n������G�}$H����	���������i8�����0��$�<���`9}2�9M��Ws?\�'b���7��J��y �4�F���|.��g�o��lV�O����������G����h������$��p">E�T$���??]�I����
���J4�A����F�[���z�m�?e�s�x�M����r$i't��{�S��������$�S{�s_5��K9�$�����������8�{���({"\�b�m=q�����b@5�2��z6{�i�L0\���)�%+�G9��z�����o�����{�]�}������B~��4����o���OVy�r�B:|�F�6�����WK��"��>zm�����'xt�u�q�_��?	����=�?����$���dl*|���S�(�����3Q��~��!tp�,�W�#�3�8�����0�HxmT+�'AY�;�A��j��CR�u�����>@���v��	���h0�������C�.�*<�6�Y�=^��3�H�=����z�@��Ol�_O?���'�3(���8
DX��c����G"�m�W��� >F�jW0��-~9�<W���!���m�3�q>�������[�j��z���w��%�}���m���!�O���@@k����2q0��-�D
�;�f�cr��4i��	5��(��R���u�S�B�Ek������'�V~%(q���+�^f����4�i0��q��2��S�O���������}��;���$vv����kf7X�A}^�m }^	��1��7��B�.^`=s��e!U���kG|�!>��|)@�M���uB����`���$T��������U��8�����O^|{��������V�d��<���^����(�����`~�o0q��s�o\�>�`��c��=��ASa����W��z����/�s�y��e�jh�C5�M4��"X�1H3H�p� I�h�Q����5P���<��>��O��`QS�cd��j(�"[��s����!���KC�i�X.��=g���<!�UpfP]n��8Zqz�;P�8%tC����cF�A��A��3(T0(���2�B��y�sA�KlX�Kdv���[�^6Nx��;o#������.4�_Evc�e�Y�\�5��d�t��
U������{�z}��6����M#�f*��C��#��C��%L���m��^{���L��>��5`��[�^�poC���t�����es(����@T
�i�@�w�T�78�)w������D\��3��Hgz�
4mM��tr!jv����x5S���
3�������.��m����r������pz
�����?	��JC��7GY���#���%
%2�Z��|���I����4����2��9fgz���_��Bhub�{ .����%l�)�x
������Q�o�Z�G�g)��>ERB��	jU�S8���nY�����	n5��b���:�����;#�����l{��p\���b����4Z~���
���=U�2o�����*qv�M��1��oyc���������������#�����*�����Y�p��mw�n�i]+�� ]	�����
��I���"Bl�����t�Nouct;}��m[7�Ob��`�u8;mXI3w�Z/')��9*�<
r�~���3<�%��	�`�K��X4X����V���;a����\
����l�L,=w9���X�������^��{����b>�W0���E�z�V-W��]���rg��5;��pxCh������gh=�>��"�M�)�/��@8��B��H��T�ne�����QK`�h�7�|/��?��8�iA�#j!t�8!�Z��"<<$��w\�^iI �@��I�P�s�6Z�@��|F���:���},�����w����s����������r=�'�����^���F~��������N-@y7��*���)�����U�4~4�G����X8�2�\9TI#�����S�IE�^��������A�r=b]�H�_����6,P��uk�X������dY%�e5��kvk'�oG4�t�kT��`��7����=h�MP�����Z�����)-���Zc
���p9��a��V����|Z�x���)�'�b�\�2���e-S��L�������p��"9z����u
>!~)��`�����V=�&��S������/c���s�ry�Z�-,���u��bE�/[�^|5��^�D�\�_O�-�W�Y�����9`��1����M�`�����#p`��C�������H�����`�S���������I�	(�?d�����H��y�puCB�d�x�}����q
�M�<�o�������fP���������	Y#�����DP@��`��B��$����A�J��os�;�,U�iJ���Y27EJ::8���8}T�$k+�� �|R.��o�ro������/�8 �WL��X�3uj�,u~�*��647bU-W�,S�����������'��X����
5�9�^����������}�+�k��Q���9�����-������g� ��3���:��U�q�>�@��?��	n��&���u�+?I��Z�}&{3�pR��q�
x/1����5�3P��5���t���z���A�K�[U�XT�<������
��tV{���{�������`��#�"Z3����@w��<���J�?�e�"���@SQo�D����'��1TR�H
>��.�=�C$#��S���vNb��^[��d���fK�����Ug�qp��������	�yo��@�� e����*���~+w62������[����u��q��F)_��T�Z��J�KwN��t\X�*�/G�`�+��1�hqBz7#�=Gk����p�3,��@���N���<����f�<!r���f�����@����bh�BL��&��lcJ��7���q4!���!�!x��0�������TR���D�u�?�vJ��������@�^���}���(Y���!��L���:����I8��lC4�M��|.�����
=���9�0'N������S&qO;s�
jo$I�8O��#����I-��FR�W�45��\�����k��V��\��t�~d�o|N�)H��v����:�'e�9����ty*��2�����b��#������������u"����A�&���r:�far��f�5�����
l�����}��wG�C� �-8=�M|�b;@ld��n�2���$s\���w?�ys���h-Ok[�d��z����B+8�N����������Zm�������D�����%�3-�g5Y[p7�J���-���B��@�6T��TA�p2>����OI��^JIk�3��%P��Z�~U��y�����n���No��RM�-v�#o����=j[q��3Y!O??�S�=���1�z_]�<�sx9�g.���\'�-x������j�W0�t#�.�O����U�i�c�s���+�!�������H����U|R3�FD��
��V-o���u��?�g� �X�������(\�7���g��W|���q�R��W��O�&��e���]���c�$�=�V��J�S�4*0�z�W0_�����_,7X�5�?���X����F"��rK����"qj��xKaJv\!Jm'I9c�L�����+�X�����w��9�S�aZL���I���x�j���HA������]�P��m�{��\h^P<�s����U����]�Y�b�Y�k�ZF����~����hS#I�t�xt�*����w�"��1!��&������WoN~}|���w/<���'���������������=�q�C�d���S�:��E^9��K��=�� {���K�,k_�(2�<1��V����b&@'?��*�������"'�V���k.�2|���Xpb�r����MP���*�^k��l�f�l�T�U�-���1;Y,�^�A��C�)��$�EE���[Q���$~wD���������e����n��v�i��i
�N�zZ�'�f��E��:�K����J��� !����\��c��:Y��
&�O��>^*K�t\���w�a	;9�.�/���'��g��$�y���|f��$hM������5������:��a!�9?��3�d�7���z��uf��!��r�:�U��IMv���D�=QS?a�K[������>?ys�q����7/j����X�������������N�QH7{`��Ae7FF/��}8-���7�/E�h���?Y���t0��9�*�W�x�3l����A8������G���Y���py�8	?���&1���C:�2�n
�^�5����MA��wE�T<����2�_���q����u�2+����=(�h�=y����s����W���P�	y1�I����6���9zI��?����#�� ��~�����i�:UN�l��.�j��S2��c}!�kj	4��mD�������R�<Z��3�M��>&7��u�IX
��@�r ]�p���q���-D���S���h����vv=�R��e�R����?����U�xeM{o��_�_�O�qv���p�/�d���#�#M���WeM�-�KD-$R��%�6FOi�-�z{���2���L���>�.�K��&����AXN.��>�"$�{���ec��k���s��$�5{����<�����
�0
�g�f&
sl�������}�?��V�5�����m?����+-���Do��{�7�mwCK�/����4�=c&�&d
"�ap�N�54t�1$Y���R��)K�L�h�����z�Me������i�J
�R��'it��������q����9�O�+=�����','����/�_hVP��b�?�g��)>	�rL�~�=&#�,90����U�g���'�
4_�cf�i��H�n�����@M�R���Zz�r�G�k���g�*���r6*��f���#����h
��-�i�:�r��Ec8�f�c� U^v��\�������d*?�M����C!Y;�< �5����'I�^�A,N�;�i�?��qa��$�O���V6�o�����FS(�]�����'�������x`���N�p9�h~��Y������>tp�]����}EH�a�=��
|���X��D�����^P$��l��"���7�
}EUF�����y���(86���8l�����J7,��0�.@Z���9<T6��p0�PZ84<���jiI�w�����(���)@b���2�bk<�M�����9%�]�i{v<l��o��'f�$�b�� �a*.�sT�����D�w�\�uK��+`D���B������:Y��H/��Q��U�\ML�(I�O�'V���2�9cl/�]X`A����5����Gi?�8;�����9�
��#$�_"�������6:�	 "�M#���i�[��O�k�Aaw~Q�����a�sU����2���FA�����a�&��q�]������O�p�/�DPH�L�W	W$��XE�V�������RHv���ml��������S`�E�����\�y(R'���GE	Ph��������a�H	�*����--{z�B������ �i��5k��o�Rq������
)��d���n�����BW8.�\F��r�8[��L]��=9I���M���d�)iF��Y��Z���{���w�V��86[}����f�����,��c6`O���/���Z��<����?� ������#����i|`Z��M�;��a�L���8�J����ez�L*�w�l��06z6d�l���
�!���T-Y��Y�v1d�E�5�����,��l�����V����?'^���(bd��M�����y�C��������~j�t���z	�{�F|�-ki�~Y\�,�)�	�3���dD�-�>y���{X]�����������k,f����h�O�ccjk
Z=�5��LE�H7�F��!B�)��	���YsIh��g�9����B	���<(x=� �Y��fYL#�A�V(k�����`�pu��@�s�e���)i����l�U������l����������i��7����Y�&���$�]7xN���Z^�7�"�c&��9�(���:�F������W�����_�B<�Y~�n
3r���S��h=��������
�V�z]!���"u�<X���������Tfm!�������ST	d����,���\��3�j��`��J��_^_��y%m��j@e=�	*dPN �B���tk��p5�]�{0~R�$�������!�c8�������V����V�0��?m_��*1hw�A�:�:2����n���d�R\�);�����!T8v�Z���h��J�[U8�J���]5l�����(�$t� _���U��(���"L��L�<8	K���z���f������%[G��-��
�?�����o]i���b�����F8��k��Bk�"�L%f`~�f���+K�;��H
$��������� l_$
���	l8�.'>���l ����������yC�'�/�*��(;��N��gG��g���I���*`�u���2@@�9NEW�a3X�}�h�[,X����-�`J5>�X������*V�p���C�M�=�'�<�������F{�5V�J~0)g9YR�FV�33f~1g"�H����K��&�.Y�\
2g��O��(�d����A�_���R���T�����-,�N�'o��@�����}���q��������"^�!g��N����?=;?��b�~g@!j��g�>w%j���S�<����2M.��4�6����E�",��U ��O=d��I)a��d,=�={����n��,=�WT��z82�����'�)����IR� ��a�����(9����a�����K��T�����g9K�
�����������B����/
nE��<�4��z*" N1�d���W�Rn��FJ����w)��@�������i8w���n6�y�fshc��i��4F/Oe�����i
j��/4
R�8����O�>��DN���?E@�V���'��E�sd���"	���U����Z����*������
����������VP�-����.M��������mA_Kl�E���z��7[��taF-$,U�����S�2���f���)�(�!�
��,�����
GU���z�b��2�
62L�*C���O���z�;���J��n4dXu3�N��z^�z]�?��l"(�}��7�b�������:���X|�+N�9?����_Y? �{�/�e8�NXZ��t��������gSe���F���J��l{������[>:c����?��������*
Nu�4�r��L�i ]���C@��[n�hR��w��4������:K~��cF�#��TY�1�*�hY����PpW�b�}�!��,��\
�J����z���
&-7��R���'V�9�T~���"�����@��%��>�+�%�S�����B}��y�^��o���Ju����O%0�����*"�I��ki���h�I�tn}�<������"j����t5C��0Z�������
���J"����I�����n�;����qs;`��Nw-(��8?�#x9��G��� PW��A�����d=���\�{z��ke��@k	���mctq9u��VL�����%d�����5�UD����6E+	1T��P�&Vz���
��#��+�'w���Z�8,^��m��l~����������e���D�oEL�GK�T<����St'D�o�����7n��(�����?"o{��E.�&,=
���`��t��!�l&���L�v�GEMM�-�w������U�w��o%��B����f����J��c�6,��)��. ���'�����Z�WEP���&�TN&QLr":\�����gl���)�����#g"�X}5"��7����0������ts�\d�l�g�����<�&�����k�����Xi�I ������W������8�"�~H�;������[���r��n���gW|�(5��/�`UD�z"����{����	c|�_�X���(?��Y|���g��
�@�����^�@���rD�+�g��pN���E2
t����N��6������|��#5?�2a�5����2��$������w:
�kM<P��������#9l<:#�*]J�A/��9�[vx��D���������m�4Z�����M����z�����~W�S��_�#�d�)J���f+��z��5	���BMeP���x�
|_����K��������kN�����lu���������?o���;[V�:NI���g��A)�y�������D[�w��(?}��-�djtH�{#�g�6�^&�K���6���S��7�X���.�
��=v4�+`~8!r�� `���5���h�}E3z��e?{$2����/����&t(�z��(������`����Y����bL�|�W{�L�	������D} �YvC9�2J��9P��\�;���e�?��+���O76�C�J�:\�#�a�k.d6���3�-��B�"�#5��M��_����iy�1%G�I<�����������#'��	��}u���%���w��D��wJw����$��:C`��Uk�q���#�5��T�&@:��0�/F�%��w�R���NC>>F��#�����B�s����.�i��������>��|�3f����Hv�p���!M�!)q�I��*�*	�2@�q��Q�.���P�.�����f�E��O����`T'����XMtC^��,t�5>9���@�2n���
W)��*
�������=���0�15��p�V���*�q�.���e��0r�����8!���[u�_��e[]0Xl��������G�T.�T�J��|��+�gk[�`�m��tjY�8�M� �?=�	�4"dN:��a����X��������D�{J���������[��+�.�����5���� 7�I���C&p�4���m����I=s���Ou�d)��a��U}����T��S���/�>�{T7e�AE%��{<�J#��h���?��C���?�&m�a�w�j����T{@��}��n����Ig�xT�u�m��) ������[n�����H�������s�7���u���UF&���<%2�7`�p]<�]����^��n�>% ����Ok�oc!t�X�Z�p��2����]�
��5�c	��l�i��?S~�������z�r�{?XB�c	�w<r���W^!.�����&�K0�1�Mvh��<i�����������AUz��[Hc}�4s�/��A���(��7��v���*��#<�X(���P��s���-�,zL��/���<�\���5������L��G��9{��m���]W���/'�"-6��p|z8f
f�A�EL?�Oi���`���\N%�<���`+G9�s.���(�.=�T=�N�M�\zlo���$,�<=�T���4PgI1����������u��u���
%��g�"�D�G�$�J��zN�������m����
����_\�w8R�Cc��y9/�����nB����E��2���l����c��+��h��("A��+�#�
�J0����5kzz^��i$����I��E��h`����=�U�����sB{���!Hx��/
��J4Z���>�:|�6�9xBR�.�E��,�}o��z�f���I]lc*���MwB����w�B4thFJ���fO�	v(��Z-9��;���r�����4�0�����;���^��m���/}x�Z�h��������`���{�cJ7#�����u8g��<�]�<R/,�Y&E:��w����#�6����OX ��4����U���Ag�����t����6]D[���!6W;{;J������}NM#��T��C���Xb
}�����YIP���2��c�������a�S��N�G��S��d8be�e���r��T�����$�N�L%�����=&�tS��TbgaMT6zq���X���e�L��(��A����}K����^`	�0;�wu�~����s�V��EH�G��������t���(��|���?U��_2/�d�)g�O�
v��������8A BhnU��!���eV�.�B�7v�;�ubc
w��g�p���|����	i�C�\�ebU��4�t>s������q�����u=����+�����&S����>p�L|
����������	b�-�%\��/pN��[u�*�D�j5;p`����	*>W�	�f��>R@ �f�/21Q��O%������S�~����l9Y������_���]O�B����`�Nq����
���T�m`� �g�Vp�^�M>w��q}��Z�Fk;\�14�w�
3�h���AK����Y���	Y���F\o5�fc;\�P*�60�����z�y\��fdq������X���{^���^���|�%}Kf�~�����=@��[�T��E�����'����5����z������n��$��1SK2W�����(�����s,J��(/4�,Ta$_��k�����`�v!s�&�n�LF�e��"B����Ab1���-H|����|���O7��3\	AU���^�ylJ��o�Z���*%�Su��
9�4Y��ji���3�?25����\�����N��F��B�)�k��x�<�����7{�u�u}�u�Qf$�S����=�:����4��&��C���#Ko0���i6:.%��p��-�Sz6L�7����{�v.���3�����V1��#�2��.��$h�S$�W�v�8��[�UV����4����N.��YY!����n@�K]��m��V� k�YF�V���,���^&��"���d���'3w;���H�����`-}3:�m�&CUz:
+�����=�3<�Wr)&b�K�����z�r�u����Y2�s�s ��A�5�,��\�UP�M���7���36�]��j6�����K�e'0��cxOT�Z��/��~�����U����v6k��~{�-���F�l�2U�HW�h.�g�T���
��3�x�F����,9H�X���@t���*4��*E^g+��TFr�e'R�|cYq)�]����8+>�^\�X��/�������r�s�OG�8T����������Aw2�����w��9���s@�$���Rt�;��4�h�u�m�)��qv�M�R�,M�T9g�iQz��������`�@��c���=_��4�"�$��1����������}�Nd�P�c->d(T(:R�,���E�#���v�2gF������io��e�����
���&o.V�Hh��}(C�`�7'`��=�W[w��<��O�����������=4M�����b�J=�+�V��r]������hR�Nz����G��(>qk��a{������+���^VO�J+[M,s�)���}��+E����7<q���&T��T{�kl��*��u�<�O�Wv��[�?������c�$(/>Qy�;^��xn��&j����(����~_u#oO6�;�����A��%����7!�:���,�p�����a���D���&8���
���"nn��_����t���l������7B^
�m�PP���v�tzH��v�w�X�I]}�D�����B
�q5���B!��N�K���(U7���h���Pl�7�I� ^)�)�h����/<���y8���1����Z�o�Ip�������o���lt�'�
P7�2E�s"��J6������F)0u�L�(�����}���MUL���W��bf�^q�����W����;�	s���C�}������a�"a���^�6�8�QL�:\����B�t��B�*��n�C����I��0�1���O?�N�<{�B�g]�P�kdm}_q������U�j���p�����5"��[�
���vC@���v���u�.>}����@�<�V�J��n���\
������U��FW�j�"E5��r�_w�P�R�<_q�^������9[��T>���?�W���'XW�a�oZ�	���
K�l*�-�f�|��oA����k�V�{6��v+��Q�����V��E{(����[��2��]d��^1c��=��=<v
\��<�4�/�rbm�x�TM7_]�c�$_A�r�+Jf��k��gTM;����K����~�R�-�$���
zO'���Fr��Q�<�p�4�Y��7;2�K)c�\A��r���X5ENwXs'�5l�
���}���v���N"�e��5q��"9�)�2�${�d�(<GR_�'��;#�n�v�Vj�4�j����K.3�{���j�b�*���k�z����+/�Xn���"�M����T��T�gl~��w�3���06A�qP����H��%U��)����a���irEy�N�L�ij���c�0o��C�����i����h|4���V�'����/�>,"O�x�9���(��6�x�Bv� ���9]���/^���~���G�����G����~��4m%����N=��#�m�f��QI����������`�'�Y"S��B=o�b�n��IF��[�r�������C�����>����M08��{���(�E�v-�:�!����G��<��#HgT|�l�gwL-�5�e;+X�z���dO,Nzk��p��
K����Oqs�6 o��cLg��\����6������vO�W���k0�o��I�Q�<�������Mi��n���A�O���]GC�;�+���~%�?&�����2��P��- �(�E�B���4����S����P�~����7���G9CC�7��7��m�v���+\��h���a�2�2��Z�A+�)��w�(s�c
8��(��������=N�|Y(�: e��
�H
��-�k���A�v#�g���Z���
���r1��vJ������?CK>�j������
p�����lk]��4�@GY�o�)c6J�[h]�Sd�X�o���6���j�~�9�`l��~����X)���F�3X�s����p+�;������HZ�Gp`�N�}�eq6��e��V���s�P��t�U��H������v|�E��T���{�8~�|�`�9tK�b�&bl��D<C��ly��N�r�NH�Ie��$�M$��������p�
�&AJ��b�#?!�{8YN�3;r�'I�[k��[�3@�~� V���T�cv������U~�J0��A��~h��><j�6����W5�jP���1%����^[����Y�;h����?���/mw����]9{�����}Z2hES���^�m)�F_� I`C���1��{��Y�;�5)�!"��������S�o����&�];5�}�0���aa�{�._C���N����}�t��t!�T�J�xI������������8���Y���=���uos
�`E�Y/1�I�)�T�:��t2���4&��?�>��+OA�r���i�*L@St���g�'�2�5\���x���Uv#��a7�X2��a�v��V�Z�I����k1��K��v\���xs�@���y63BD��	�gh���?��������bZ�.���]������b�� $�4�?AP!]��#�zUdd��3��0��ikRIF��VQ�l�)����#������7��/Vzq��0��,+I2.)��#�6�����Z�S�TY��kI,.��/�w�P��yQ����5��'m�E�p�}C�����]{��(?��N�W�L����o�
?�=�
�o�r�n�����-|�.%UE�4R���T�~]���N��X�n�A�*5J��J���e!���~�x����a,�~���Z�q�Tr�"\�1��`i&�@1�p�������0���>f��(<V �/0}��R���-����;#y�S��H���EWV�2@U�?�t��t6h����{����5WZ5�����F���+=����E� ��d����7l�c?�M�������Q�5�9/��k��1^��/�4^��yN�T$8�~NM����u�����i����AJ���z�q�4�?
�(Nn�?�"��������9e���a�$���K��~ T�l��I����!;T�����B��*B���<<�2��BFs��
����?��k0�����`��rFN-�({�������l�sV��
Q|�8���Lh����l�KK��8E
�#�C�Td6��
�o��`�:����
���t�x����S�d(��2��4���'�Z�B��pi�����4�&�J�������I���G�H�q�^�s��� q�)��	0��|m����
����_���U���
��Vc���g��~T�8��JgW��7]6��r����3��"H&X��r��^������'N�L��D�	��i�_�z����-N&�Xb��9"�O'��Y���LS
C9
O��trq�h"lg�U�-��\-w�H����p�rP@� N��
c�O��Z�l&�f��_Y���"@v�YL�YT����/+�@M��P����X���d',�H�yU�x_�u����T�*�1UTy+�,;��*�������r�MU�s��*���$s)9��2�>&�(�
A{������!{��<K���'E����]�����}��,�s�����Y/��,c��PiiI��c�7��`��`�A�$\9��S��)�hZ6�C�+V3Td
���>�A����D��]����s����W���j�`�b���25����~?E��>=�
���@��Y#*����^�{}�����Un
H�����q�3������HL�4�\/��W]��������d
r�U�rq8z����%�3�(IBt���I�|EkLp��A1T���u!~�\��G�A>O�L�~w6z{��������pB���*�>�8Q.��R�r2_O�'�|L���_HV��F���~�W�&�Iw�����
UU+{>����~����7��~x��L�{���?�"]���h�@�(��5����T"�|�P�F'2[��_~����~`S�+���$�aeI@Fq@��3���o:'nY�������^J�`����,��J#,�2�l��p
�cZy�K�XA\��*��Y:;�}C?7��v���c�	/�+w�s�U�~����Y�^��[�a�Sp����O}�N�N�EO������M�3�� _����N}2!����:�=Z~��l����{���_^+�,����k�FG��hpc�o��^�Sq���T������?_�xRw
��C���3z�����e_?�.����ki����R�GA��Vq�	�
���-g��A�2\ES`!_z��?���C�K������x�=��c�R#;�Y��P��T��U^��L���^���q�����_J�u�@i�;hy�A��2�/oI�4��1^�Q�J�F�du#�iA�G��=���6N����Z�`S�3� �^���lG�J��c�_�Hp>����,D��B���]X�����$nEUj�Y�9��W�7�q�o�����`�l E��<N���p6t���a�	]+dHd�l=
�.;+�����R6��}��������9Gm�b���[�����'�E�>������d���~wFF����9�P��'�D�;�I��2[��'���+��q���(�Rl�5`�Na"����v�M�P���J����_(��6�
���m���
�;2�ao��abd�4E���E�{�Dn�I�W2a�Vk6��8���f�W'�_uRu�n��f�:�3jy[+�z@�����
7��%3��M�%�r�v���6�y�D,���d_1{���U
��NxP�i"���u�$(��[����`1h:Q�8D�_Fr)�We�{�f�"���L�����d��������������5�������^h�%�j�0.+��d$	�oE�x/��x�u�����m��?t<T��<f<|�?�g�K.�\����9U��z5��S�|��&���9�le4e<!��l5M��-�!��������/�
cT��2��`c�������&\9�=��^�]j�f���%�f*Z����(���
�)��1
��~]xo�E���D����U�����������E�a�����Vr@?*��gd�>����Y1��zn�\�m�����eNI�0T�7A�5�_��G�s�g��:*&�����ip�bYfyg^8.����7����T����M��i�����t�:Kq:���1���RCQ�]����?��
4?U��`�[�����}���P/s	V�����7������|lZ���!>f#�?��y�Q�/�_�K/����T��JA��k����g��^�L;��-�V�z��*E�����j5�>���,.���O��KZ/@� >"d�Nq�k�0��m��X��@�<~�i������wo^�{)��C`����?��>�����/_�1���S�$���l�c�*u]&
�u������f5G��*��e{�R��h��2G�]����	��]����*�nt;�.����Ak�.Ej��2���d��� [v�����&���]�G�}�����GN���<�~�	�9�A����\�s�1N3Zg\s�e�RD+_��L\]DI �A�CGlW�Rq����$��e|!���B����Ca|����l�|+�!���������U�$��v�
QS������Fd��1�`MX�m�^D����?	��p�G��!��s����rK0�B�u�D������d��t��z}���z��:��5�as�,������N���i�����6���h+�x��^���[0t�{l{��;}�o�#�>z&X�!�/�����=������T�}B6&+�0��?�M��b���DT�0�ZOP�u@@J��F���f3,8}������z�����T|�>�+��/�x��������Hv�[���ig�:6��M������1� ������C���[�*��7{)�m����O�Z���&�!��Q��Cn<	�~��P
<�Tf������+|8�_��BO����r��l!��[Mr�����g o>2��.?$w�!�/�!�wx}��J
#39Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#27)
Re: PoC: Partial sort

Hi,

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti@juffo.org> wrote:

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's a lot
easier to test. The attached patch applies on top of partial-sort-5.patch

I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :)

I don't understand. I had another look at this and cost_sort still
seems like the best place to implement this, since that's where the
patch decides how many pre-sorted columns to use. Both mergejoin and
simple order-by plans call into it. If enable_partialsort=false then I
skip all pre-sorted options except full sort, making cost_sort behave
pretty much like it did before the patch.

I could change pathkeys_common to return 0, but that seems like a
generic function that shouldn't be tied to partialsort. The old code
paths called pathkeys_contained_in anyway, which has similar
complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
special enough to make an exception for).

Or should I use?:
enable_partialsort ? pathkeys_common(...) : 0

For instance, merge join rely on partial sort which will be
replaced with simple sort.

Are you saying that enable_partialsort=off should keep
partialsort-based mergejoins enabled?

Or are you saying that merge joins shouldn't use "simple sort" at all?
But merge join was already able to use full Sort nodes before your
patch.

What am I missing?

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#38)
1 attachment(s)
Re: PoC: Partial sort

On Mon, Jan 20, 2014 at 2:43 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Another changes in this version of patch:
1) Applied patch to don't compare skipCols in tuplesort by Marti Raudsepp
2) Adjusting sort bound after processing buckets.

Hi,

Here's a patch with some whitespace and coding style fixes for
partial-sort-6.patch

I tried to understand the mergejoin regression, but this code still
looks like Chinese to me. Can anyone else have a look at it?

Test case: /messages/by-id/CABRT9RDd-P2RLRdHsMq8rCOB46k4a5O+bGz_up2bRGeeH4R6oQ@mail.gmail.com
Original report:
/messages/by-id/CABRT9RCLLUyJ=bkeB132aVA_mVNx5==LvVvQMvUqDguFZtW+cg@mail.gmail.com

Regards,
Marti

Attachments:

0001-Whitespace-coding-style-fixes.patchtext/x-patch; charset=US-ASCII; name=0001-Whitespace-coding-style-fixes.patchDownload
From a3cedb922c5a12e43ee94b9d6f5a2aefba701708 Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <marti@juffo.org>
Date: Sun, 26 Jan 2014 16:25:45 +0200
Subject: [PATCH 1/2] Whitespace & coding style fixes

---
 src/backend/executor/nodeSort.c         | 17 +++++++++--------
 src/backend/optimizer/path/costsize.c   |  8 ++++----
 src/backend/optimizer/path/pathkeys.c   | 18 +++++++++---------
 src/backend/optimizer/plan/createplan.c |  2 +-
 src/backend/optimizer/plan/planner.c    |  6 +++---
 src/backend/utils/sort/tuplesort.c      |  2 +-
 6 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index f38190d..2e50497 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -27,13 +27,14 @@
 static bool
 cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
 {
-	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+	int			i,
+				n = ((Sort *)node->ss.ps.plan)->skipCols;
 
 	for (i = 0; i < n; i++)
 	{
-		Datum datumA, datumB;
-		bool isnullA, isnullB;
-		AttrNumber attno = node->skipKeys[i].ssup_attno;
+		Datum		datumA, datumB;
+		bool		isnullA, isnullB;
+		AttrNumber	attno = node->skipKeys[i].ssup_attno;
 
 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
 		datumB = slot_getattr(b, attno, &isnullB);
@@ -147,7 +148,7 @@ ExecSort(SortState *node)
 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
 
 	/*
-	 * Put next group of tuples where skipCols" sort values are equal to
+	 * Put next group of tuples where skipCols' sort values are equal to
 	 * tuplesort.
 	 */
 	for (;;)
@@ -177,10 +178,10 @@ ExecSort(SortState *node)
 			}
 			else
 			{
-				bool cmp;
-				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
+				bool		equal;
+				equal = cmpSortSkipCols(node, tupDesc, node->prev, slot);
 				node->prev = ExecCopySlotTuple(slot);
-				if (!cmp)
+				if (!equal)
 					break;
 			}
 		}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 9e79c6d..3a18632 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1331,13 +1331,13 @@ cost_sort(Path *path, PlannerInfo *root,
 	 */
 	if (presorted_keys > 0)
 	{
-		List *groupExprs = NIL;
-		ListCell *l;
-		int i = 0;
+		List	   *groupExprs = NIL;
+		ListCell   *l;
+		int			i = 0;
 
 		foreach(l, pathkeys)
 		{
-			PathKey *key = (PathKey *)lfirst(l);
+			PathKey	   *key = (PathKey *) lfirst(l);
 			EquivalenceMember *member = (EquivalenceMember *)
 								lfirst(list_head(key->pk_eclass->ec_members));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 55d8ef4..1e1a09a 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -319,10 +319,9 @@ compare_pathkeys(List *keys1, List *keys2)
 int
 pathkeys_common(List *keys1, List *keys2)
 {
-	int n;
+	int 		n = 0;
 	ListCell   *key1,
 			   *key2;
-	n = 0;
 
 	forboth(key1, keys1, key2, keys2)
 	{
@@ -460,7 +459,7 @@ get_cheapest_fractional_path_for_pathkeys(List *paths,
 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
 	foreach(l, pathkeys)
 	{
-		PathKey *key = (PathKey *)lfirst(l);
+		PathKey *key = (PathKey *) lfirst(l);
 		EquivalenceMember *member = (EquivalenceMember *)
 							lfirst(list_head(key->pk_eclass->ec_members));
 
@@ -1085,7 +1084,6 @@ find_mergeclauses_for_pathkeys(PlannerInfo *root,
 	List	   *mergeclauses = NIL;
 	ListCell   *i;
 	bool	   *used = (bool *)palloc0(sizeof(bool) * list_length(restrictinfos));
-	int			k;
 	List	   *unusedRestrictinfos = NIL;
 	List	   *usedPathkeys = NIL;
 
@@ -1103,6 +1101,7 @@ find_mergeclauses_for_pathkeys(PlannerInfo *root,
 		EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
 		List	   *matched_restrictinfos = NIL;
 		ListCell   *j;
+		int			k = 0;
 
 		/*----------
 		 * A mergejoin clause matches a pathkey if it has the same EC.
@@ -1140,7 +1139,6 @@ find_mergeclauses_for_pathkeys(PlannerInfo *root,
 		 * deal with the case in create_mergejoin_plan().
 		 *----------
 		 */
-		k = 0;
 		foreach(j, restrictinfos)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(j);
@@ -1182,7 +1180,9 @@ find_mergeclauses_for_pathkeys(PlannerInfo *root,
 	 */
 	if (outersortkeys)
 	{
-		List *addPathkeys, *addMergeclauses;
+		List	   *addPathkeys,
+				   *addMergeclauses;
+		int			k = 0;
 
 		*outersortkeys = pathkeys;
 
@@ -1192,7 +1192,6 @@ find_mergeclauses_for_pathkeys(PlannerInfo *root,
 		/*
 		 * Find restrictions unused by given pathkeys.
 		 */
-		k = 0;
 		foreach(i, restrictinfos)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);
@@ -1208,7 +1207,8 @@ find_mergeclauses_for_pathkeys(PlannerInfo *root,
 		 * Generate pathkeys based on those restrictions.
 		 */
 		addPathkeys = select_outer_pathkeys_for_merge(root,
-												unusedRestrictinfos, joinrel);
+													  unusedRestrictinfos,
+													  joinrel);
 
 		if (!addPathkeys)
 			return mergeclauses;
@@ -1631,7 +1631,7 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
-	int n;
+	int			n;
 
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d9a65c3..755f5e6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -3744,7 +3744,7 @@ make_mergejoin(List *tlist,
  */
 static Sort *
 make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
-          List *pathkeys, int skipCols,
+		  List *pathkeys, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst,
 		  double limit_tuples)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e5cf5a8..0c3d18d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1763,7 +1763,7 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 					int			n_common_pathkeys;
 
 					n_common_pathkeys = pathkeys_common(window_pathkeys,
-													    current_pathkeys);
+														current_pathkeys);
 
 					sort_plan = make_sort_from_pathkeys(root,
 														result_plan,
@@ -1946,8 +1946,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 	 */
 	if (parse->sortClause)
 	{
-		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
-		
+		int			common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
+
 		if (common < list_length(root->sort_pathkeys))
 		{
 			result_plan = (Plan *) make_sort_from_pathkeys(root,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index fb5d8b5..4a0ce29 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -949,7 +949,7 @@ tuplesort_end(Tuplesortstate *state)
 void
 tuplesort_reset(Tuplesortstate *state)
 {
-	int i;
+	int			i;
 
 	if (state->tapeset)
 		LogicalTapeSetClose(state->tapeset);
-- 
1.8.5.3

#41Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#39)
1 attachment(s)
Re: PoC: Partial sort

Hi!

On Tue, Jan 21, 2014 at 3:24 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti@juffo.org> wrote:

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's

a lot

easier to test. The attached patch applies on top of

partial-sort-5.patch

I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :)

I don't understand. I had another look at this and cost_sort still
seems like the best place to implement this, since that's where the
patch decides how many pre-sorted columns to use. Both mergejoin and
simple order-by plans call into it. If enable_partialsort=false then I
skip all pre-sorted options except full sort, making cost_sort behave
pretty much like it did before the patch.

I could change pathkeys_common to return 0, but that seems like a
generic function that shouldn't be tied to partialsort. The old code
paths called pathkeys_contained_in anyway, which has similar
complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
special enough to make an exception for).

Or should I use?:
enable_partialsort ? pathkeys_common(...) : 0

For instance, merge join rely on partial sort which will be
replaced with simple sort.

Are you saying that enable_partialsort=off should keep
partialsort-based mergejoins enabled?

Or are you saying that merge joins shouldn't use "simple sort" at all?
But merge join was already able to use full Sort nodes before your
patch.

Sorry that I didn't explained it. In particular I mean following:
1) With enable_partialsort = off all mergejoin logic should behave as
without partial sort patch.
2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
function is much more expensive to execute. With enable_partialsort = off
it should be as cheap as without partial sort patch.
I'll try to implement this option in this week.
For now, I have attempt to fix extra columns in mergejoin problem. It would
be nice if you test it.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-7.patch.gzapplication/x-gzip; name=partial-sort-7.patch.gzDownload
#42Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#41)
Re: PoC: Partial sort

On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

For now, I have attempt to fix extra columns in mergejoin problem. It would
be nice if you test it.

Yes, it solves the test cases I was trying with, thanks.

1) With enable_partialsort = off all mergejoin logic should behave as
without partial sort patch.
2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
function is much more expensive to execute. With enable_partialsort = off it
should be as cheap as without partial sort patch.

When it comes to planning time, I really don't think you should
bother. The planner enable_* settings are meant for troubleshooting,
debugging and learning about the planner. You should not expect people
to disable them in a production setting. It's not worth complicating
the code for that rare case.

This is stated in the documentation
(http://www.postgresql.org/docs/current/static/runtime-config-query.html)
and repeatedly on the mailing lists.

But some benchmarks of planning performance are certainly warranted.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#42)
Re: PoC: Partial sort

On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

For now, I have attempt to fix extra columns in mergejoin problem. It

would

be nice if you test it.

Yes, it solves the test cases I was trying with, thanks.

1) With enable_partialsort = off all mergejoin logic should behave as
without partial sort patch.
2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
function is much more expensive to execute. With enable_partialsort =

off it

should be as cheap as without partial sort patch.

When it comes to planning time, I really don't think you should
bother. The planner enable_* settings are meant for troubleshooting,
debugging and learning about the planner. You should not expect people
to disable them in a production setting. It's not worth complicating
the code for that rare case.

This is stated in the documentation
(http://www.postgresql.org/docs/current/static/runtime-config-query.html)
and repeatedly on the mailing lists.

But some benchmarks of planning performance are certainly warranted.

I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id off
by default because of planning overhead.

------
With best regards,
Alexander Korotkov.

#44Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#43)
Re: PoC: Partial sort

On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id off
by default because of planning overhead.

I see, that makes sense.

I will try to find the time to run some benchmarks in the coming few days.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#43)
Re: PoC: Partial sort

On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp <marti@juffo.org> wrote:

But some benchmarks of planning performance are certainly warranted.

I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id
off by default because of planning overhead.

Sorry I didn't get around to this before.

I ran some synthetic benchmarks with single-column inner joins between 5
tables, with indexes on both joined columns, using only EXPLAIN (so
measuring planning time, not execution) in 9 scenarios to excercise
different code paths. According to these measurements, the overhead ranges
between 1.0 and 4.5% depending on the scenario.

----
Merge join with partial sort children seems like a fairly obscure use case
(though I'm sure it can help a lot in those cases). The default should
definitely allow partial sort in normal ORDER BY queries. What's under
question here is whether to enable partial sort for mergejoin.

So I see 3 possible resolutions:
1. The overhead is deemed acceptable to enable by default, in which case
we're done here.
2. Add a three-value runtime setting like: enable_partialsort = [ off |
no_mergejoin | on ], defaulting to no_mergejoin (just to get the point
across, clearly we need better naming). This is how constraint_exclusion
works.
3. Remove the partialsort mergejoin code entirely, keeping the rest of the
cases.

What do you think?

----
All the tests are available here:
https://github.com/intgr/benchjunk/tree/master/partial_sort (using script
run2.sh)

Overhead by test (partial-sort-7.patch.gz):
join5.sql 2.9% (all joins on the same column)
star5.sql 1.7% ("star schema" kind of join)
line5.sql 1.9% (joins chained to each other)
lim_join5.sql 4.5% (same as above, with LIMIT 1)
lim_star5.sql 2.8%
lim_line5.sql 1.8%
limord_join5.sql 4.3% (same as above, with ORDER BY & LIMIT 1)
limord_star5.sql 3.9%
limord_line5.sql 1.0%

Full data:
PostgreSQL @ git ac8bc3b
join5.sql tps = 499.490173 (excluding connections establishing)
join5.sql tps = 503.756335 (excluding connections establishing)
join5.sql tps = 504.814072 (excluding connections establishing)
star5.sql tps = 492.799230 (excluding connections establishing)
star5.sql tps = 492.570615 (excluding connections establishing)
star5.sql tps = 491.949985 (excluding connections establishing)
line5.sql tps = 773.945050 (excluding connections establishing)
line5.sql tps = 773.858068 (excluding connections establishing)
line5.sql tps = 774.551240 (excluding connections establishing)
lim_join5.sql tps = 392.539745 (excluding connections establishing)
lim_join5.sql tps = 391.867549 (excluding connections establishing)
lim_join5.sql tps = 393.361655 (excluding connections establishing)
lim_star5.sql tps = 418.431804 (excluding connections establishing)
lim_star5.sql tps = 419.258985 (excluding connections establishing)
lim_star5.sql tps = 419.434697 (excluding connections establishing)
lim_line5.sql tps = 713.852506 (excluding connections establishing)
lim_line5.sql tps = 713.636694 (excluding connections establishing)
lim_line5.sql tps = 712.971719 (excluding connections establishing)
limord_join5.sql tps = 381.068465 (excluding connections establishing)
limord_join5.sql tps = 380.379359 (excluding connections establishing)
limord_join5.sql tps = 381.182385 (excluding connections establishing)
limord_star5.sql tps = 412.997935 (excluding connections establishing)
limord_star5.sql tps = 411.401352 (excluding connections establishing)
limord_star5.sql tps = 413.209784 (excluding connections establishing)
limord_line5.sql tps = 688.906406 (excluding connections establishing)
limord_line5.sql tps = 689.445483 (excluding connections establishing)
limord_line5.sql tps = 688.758042 (excluding connections establishing)

partial-sort-7.patch.gz
join5.sql tps = 479.508034 (excluding connections establishing)
join5.sql tps = 488.263674 (excluding connections establishing)
join5.sql tps = 490.127433 (excluding connections establishing)
star5.sql tps = 482.106063 (excluding connections establishing)
star5.sql tps = 484.179687 (excluding connections establishing)
star5.sql tps = 483.027372 (excluding connections establishing)
line5.sql tps = 758.092993 (excluding connections establishing)
line5.sql tps = 759.697814 (excluding connections establishing)
line5.sql tps = 759.792792 (excluding connections establishing)
lim_join5.sql tps = 375.517211 (excluding connections establishing)
lim_join5.sql tps = 375.539109 (excluding connections establishing)
lim_join5.sql tps = 375.841645 (excluding connections establishing)
lim_star5.sql tps = 407.683110 (excluding connections establishing)
lim_star5.sql tps = 407.414409 (excluding connections establishing)
lim_star5.sql tps = 407.526613 (excluding connections establishing)
lim_line5.sql tps = 699.905101 (excluding connections establishing)
lim_line5.sql tps = 700.349675 (excluding connections establishing)
lim_line5.sql tps = 700.661762 (excluding connections establishing)
limord_join5.sql tps = 364.607236 (excluding connections establishing)
limord_join5.sql tps = 364.367705 (excluding connections establishing)
limord_join5.sql tps = 363.694065 (excluding connections establishing)
limord_star5.sql tps = 397.036792 (excluding connections establishing)
limord_star5.sql tps = 397.197359 (excluding connections establishing)
limord_star5.sql tps = 395.797940 (excluding connections establishing)
limord_line5.sql tps = 680.907397 (excluding connections establishing)
limord_line5.sql tps = 682.206481 (excluding connections establishing)
limord_line5.sql tps = 681.210267 (excluding connections establishing)

Regards,
Marti

#46Robert Haas
robertmhaas@gmail.com
In reply to: Marti Raudsepp (#45)
Re: PoC: Partial sort

On Wed, Feb 5, 2014 at 6:58 PM, Marti Raudsepp <marti@juffo.org> wrote:

I ran some synthetic benchmarks with single-column inner joins between 5
tables, with indexes on both joined columns, using only EXPLAIN (so
measuring planning time, not execution) in 9 scenarios to excercise
different code paths. According to these measurements, the overhead ranges
between 1.0 and 4.5% depending on the scenario.

Hmm, sounds a little steep. Why is it so expensive? I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.
Changing the cost estimation that way can't be that much more
expensive than what we're already doing, so the overhead should be
minimal. What the patch is actually doing seems to be something quite
a bit more invasive than that, but I'm not sure what it is exactly, or
why.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Marti Raudsepp
marti@juffo.org
In reply to: Robert Haas (#46)
Re: PoC: Partial sort

On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, sounds a little steep. Why is it so expensive? I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
* Avoid rebuilding clause list if we already made one;
* saves memory in big join trees...
*/

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Robert Haas
robertmhaas@gmail.com
In reply to: Marti Raudsepp (#47)
Re: PoC: Partial sort

On Thu, Feb 6, 2014 at 3:39 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, sounds a little steep. Why is it so expensive? I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
* Avoid rebuilding clause list if we already made one;
* saves memory in big join trees...
*/

Yeah, I noticed that. My feeling is that those optimizations got put
in there because someone found them to be important, so I'm skeptical
about removing them. It may be that having the capability to do a
partial sort makes it seem worth spending more CPU looking for merge
joins, but I'd vote for making any such change a separate patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Marti Raudsepp
marti@juffo.org
In reply to: Robert Haas (#48)
Re: PoC: Partial sort

On Thu, Feb 6, 2014 at 9:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It may be that having the capability to do a
partial sort makes it seem worth spending more CPU looking for merge
joins, but I'd vote for making any such change a separate patch.

Agreed.

Alexander, should I work on splitting up the patch in two, or do you
want to do it yourself?

Should I merge my coding style and enable_partialsort patches while at
it, or do you still have reservations about those?

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#48)
Re: PoC: Partial sort

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 6, 2014 at 3:39 AM, Marti Raudsepp <marti@juffo.org> wrote:

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Yeah, I noticed that. My feeling is that those optimizations got put
in there because someone found them to be important, so I'm skeptical
about removing them.

I put them in, and yeah they are important. Even with those, and even
with the rather arbitrary heuristic restrictions that joinpath.c puts on
what mergeclause lists to consider, the existing planner spends a whole
lot of effort on mergejoins --- possibly disproportionate to their actual
value. I think that any patch that removes those optimizations is not
going to fly. If anything, it'd be better to reduce the number of
mergejoins considered even further, because a lot of the possible plans
are not usefully different.

It's already the case that we expect indxpath.c to predict the useful
orderings (by reference to query_pathkeys and available mergejoin clauses)
and generate suitable paths, rather than trying to identify the orderings
at join time. Can't that approach be extended to cover this technique?

In any case, the bottom line is that we don't want this patch to cause
the planner to consider large numbers of new but useless sort orderings.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#47)
Re: PoC: Partial sort

On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, sounds a little steep. Why is it so expensive? I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
* Avoid rebuilding clause list if we already made one;
* saves memory in big join trees...
*/

This is not only place that worry me about planning overhead.
See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
groups for each sorting column in order to get right fractional path. For
partial sort path, cost of first batch should be included into initial
cost.
If don't do so, optimizer can pick up strange plans basing on assumption
that it need only few rows from inner node. See an example.

create table test1 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create index test1_v1_idx on test1 (v1);

Plan without fraction estimation in
get_cheapest_fractional_path_for_pathkeys:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Limit (cost=198956893.20..198956913.33 rows=10 width=24)
-> Partial sort (cost=198956893.20..19909637942.82 rows=9791031169
width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Nested Loop (cost=0.42..19883065506.84 rows=9791031169
width=24)
Join Filter: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=0.00..25289.00 rows=1000000 width=12)
-> Seq Scan on test2 t2 (cost=0.00..15406.00
rows=1000000 width=12)
(9 rows)

Current version of patch:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Limit (cost=3699913.43..3699913.60 rows=10 width=24)
-> Partial sort (cost=3699913.43..173638549.67 rows=9791031169
width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Merge Join (cost=150444.79..147066113.70 rows=9791031169
width=24)
Merge Cond: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=149244.84..154244.84 rows=1000000
width=12)
-> Sort (cost=149244.84..151744.84 rows=1000000
width=12)
Sort Key: t2.v1
-> Seq Scan on test2 t2 (cost=0.00..15406.00
rows=1000000 width=12)
(11 rows)

I don't compare actual execution times because I didn't wait until first
plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 1000000
rows is odd.

------
With best regards,
Alexander Korotkov.

#52Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#51)
Re: PoC: Partial sort

On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

This is not only place that worry me about planning overhead. See
get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
groups for each sorting column in order to get right fractional path.

AFAICT this only happens once per plan and the overhead is O(n) to the
number of pathkeys? I can't get worried about that, but I guess it's
better to test anyway.

PS: You didn't answer my questions about splitting the patch. I guess
I'll have to do that anyway to run the tests.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#52)
2 attachment(s)
Re: PoC: Partial sort

On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

This is not only place that worry me about planning overhead. See
get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
groups for each sorting column in order to get right fractional path.

AFAICT this only happens once per plan and the overhead is O(n) to the
number of pathkeys? I can't get worried about that, but I guess it's
better to test anyway.

PS: You didn't answer my questions about splitting the patch. I guess
I'll have to do that anyway to run the tests.

Done. Patch is splitted.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-basic-1.patch.gzapplication/x-gzip; name=partial-sort-basic-1.patch.gzDownload
partial-sort-merge-1.patch.gzapplication/x-gzip; name=partial-sort-merge-1.patch.gzDownload
#54Marti Raudsepp
marti@juffo.org
In reply to: Alexander Korotkov (#53)
1 attachment(s)
Re: PoC: Partial sort

On Mon, Feb 10, 2014 at 8:59 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Done. Patch is splitted.

Thanks!

I think the 1st patch now has a bug in initial_cost_mergejoin; you
still pass the "presorted_keys" argument to cost_sort, making it
calculate a partial sort cost, but generated plans never use partial
sort. I think 0 should be passed instead. Patch attached, needs to be
applied on top of partial-sort-basic-1 and then reverse-applied on
partial-sort-merge-1.

With partial-sort-basic-1 and this fix on the same test suite, the
planner overhead is now a more manageable 0.5% to 1.3%; one test is
faster by 0.5%. Built with asserts disabled, ran on Intel i5-3570K. In
an effort to reduce variance, I locked the server and pgbench to a
single CPU core (taskset -c 3), but there are still noticeable
run-to-run differences, so these numbers are a bit fuzzy. The faster
result is definitely not a fluke, however; it happens every time.

On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp <marti@juffo.org> wrote:

AFAICT this only happens once per plan and the overhead is O(n) to the
number of pathkeys?

I was of course wrong about that, it also adds extra overhead when
iterating over the paths list.

----
Test "taskset -c 3 run2.sh" from
https://github.com/intgr/benchjunk/tree/master/partial_sort

Overhead percentages (between best of each 3 runs):
join5.sql 0.7
star5.sql 0.8
line5.sql 0.5
lim_join5.sql -0.5
lim_star5.sql 1.3
lim_line5.sql 0.5
limord_join5.sql 0.6
limord_star5.sql 0.5
limord_line5.sql 0.7

Raw results:
git 48870dd
join5.sql tps = 509.328070 (excluding connections establishing)
join5.sql tps = 509.772190 (excluding connections establishing)
join5.sql tps = 510.651517 (excluding connections establishing)
star5.sql tps = 499.208698 (excluding connections establishing)
star5.sql tps = 498.200314 (excluding connections establishing)
star5.sql tps = 496.269315 (excluding connections establishing)
line5.sql tps = 797.968831 (excluding connections establishing)
line5.sql tps = 797.011690 (excluding connections establishing)
line5.sql tps = 796.379258 (excluding connections establishing)
lim_join5.sql tps = 394.946024 (excluding connections establishing)
lim_join5.sql tps = 395.417689 (excluding connections establishing)
lim_join5.sql tps = 395.482958 (excluding connections establishing)
lim_star5.sql tps = 423.434393 (excluding connections establishing)
lim_star5.sql tps = 423.774305 (excluding connections establishing)
lim_star5.sql tps = 424.386099 (excluding connections establishing)
lim_line5.sql tps = 733.007330 (excluding connections establishing)
lim_line5.sql tps = 731.794731 (excluding connections establishing)
lim_line5.sql tps = 732.356280 (excluding connections establishing)
limord_join5.sql tps = 385.317921 (excluding connections establishing)
limord_join5.sql tps = 385.915870 (excluding connections establishing)
limord_join5.sql tps = 384.747848 (excluding connections establishing)
limord_star5.sql tps = 417.992615 (excluding connections establishing)
limord_star5.sql tps = 416.944685 (excluding connections establishing)
limord_star5.sql tps = 418.262647 (excluding connections establishing)
limord_line5.sql tps = 708.979203 (excluding connections establishing)
limord_line5.sql tps = 710.926866 (excluding connections establishing)
limord_line5.sql tps = 710.928907 (excluding connections establishing)

48870dd + partial-sort-basic-1.patch.gz + fix-cost_sort.patch
join5.sql tps = 505.488181 (excluding connections establishing)
join5.sql tps = 507.222759 (excluding connections establishing)
join5.sql tps = 506.549654 (excluding connections establishing)
star5.sql tps = 495.432915 (excluding connections establishing)
star5.sql tps = 494.906793 (excluding connections establishing)
star5.sql tps = 492.623808 (excluding connections establishing)
line5.sql tps = 789.315968 (excluding connections establishing)
line5.sql tps = 793.875456 (excluding connections establishing)
line5.sql tps = 790.545990 (excluding connections establishing)
lim_join5.sql tps = 396.956732 (excluding connections establishing)
lim_join5.sql tps = 397.515213 (excluding connections establishing)
lim_join5.sql tps = 397.578669 (excluding connections establishing)
lim_star5.sql tps = 417.459963 (excluding connections establishing)
lim_star5.sql tps = 418.024803 (excluding connections establishing)
lim_star5.sql tps = 418.830234 (excluding connections establishing)
lim_line5.sql tps = 729.186915 (excluding connections establishing)
lim_line5.sql tps = 726.288788 (excluding connections establishing)
lim_line5.sql tps = 728.123296 (excluding connections establishing)
limord_join5.sql tps = 383.484767 (excluding connections establishing)
limord_join5.sql tps = 383.021960 (excluding connections establishing)
limord_join5.sql tps = 383.722051 (excluding connections establishing)
limord_star5.sql tps = 414.138460 (excluding connections establishing)
limord_star5.sql tps = 414.063766 (excluding connections establishing)
limord_star5.sql tps = 416.130110 (excluding connections establishing)
limord_line5.sql tps = 706.002589 (excluding connections establishing)
limord_line5.sql tps = 705.632796 (excluding connections establishing)
limord_line5.sql tps = 704.991305 (excluding connections establishing)

Regards,
Marti

Attachments:

fix-cost_sort.patchtext/x-patch; charset=US-ASCII; name=fix-cost_sort.patchDownload
commit c310b6649c4ff9929d4d26ff965b88cbe915dd6c
Author: Marti Raudsepp <marti@juffo.org>
Date:   Wed Feb 12 22:26:26 2014 +0200

    fix cost_sort for partial sorts in initial_cost_mergejoin

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0e60fd7..c289924 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2083,7 +2083,7 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
-				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  0,
 				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
@@ -2111,7 +2111,7 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
-				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  0,
 				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
#55Marti Raudsepp
marti@juffo.org
In reply to: Marti Raudsepp (#54)
Re: PoC: Partial sort

On Wed, Feb 12, 2014 at 11:54 PM, Marti Raudsepp <marti@juffo.org> wrote:

With partial-sort-basic-1 and this fix on the same test suite, the
planner overhead is now a more manageable 0.5% to 1.3%; one test is
faster by 0.5%.

Ping, Robert or anyone, does this overhead seem bearable or is that
still too much?

Do these numbers look conclusive enough or should I run more tests?

I think the 1st patch now has a bug in initial_cost_mergejoin; you
still pass the "presorted_keys" argument to cost_sort, making it
calculate a partial sort cost

Ping, Alexander?

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Alexander Korotkov
aekorotkov@gmail.com
In reply to: Marti Raudsepp (#54)
Re: PoC: Partial sort

On Thu, Feb 13, 2014 at 1:54 AM, Marti Raudsepp <marti@juffo.org> wrote:

I think the 1st patch now has a bug in initial_cost_mergejoin; you
still pass the "presorted_keys" argument to cost_sort, making it
calculate a partial sort cost, but generated plans never use partial
sort. I think 0 should be passed instead. Patch attached, needs to be
applied on top of partial-sort-basic-1 and then reverse-applied on
partial-sort-merge-1.

It doesn't look so for me. Merge join doesn't find partial sort especially.
But if path with some presorted pathkeys will be accidentally selected then
partial sort will be used. See create_mergejoin_plan function. So, I think
this cost_sort call is relevant to create_mergejoin_plan. If we don't want
partial sort to be used in such rare cases then we should revert it from
both places. However, I doubt that it does any overhead, so we can leave it
as is.

------
With best regards,
Alexander Korotkov.

#57Robert Haas
robertmhaas@gmail.com
In reply to: Marti Raudsepp (#55)
Re: PoC: Partial sort

On Wed, Feb 19, 2014 at 1:39 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Wed, Feb 12, 2014 at 11:54 PM, Marti Raudsepp <marti@juffo.org> wrote:

With partial-sort-basic-1 and this fix on the same test suite, the
planner overhead is now a more manageable 0.5% to 1.3%; one test is
faster by 0.5%.

Ping, Robert or anyone, does this overhead seem bearable or is that
still too much?

Do these numbers look conclusive enough or should I run more tests?

Tom should really be the one to comment on this, I think. I read
through the patch quickly and it looks much less scary than the early
versions, but it's not obvious to me whether the remaining overhead is
enough to worry about. I'd need to spend more time studying it to
form a really sound opinion on that topic, and unfortunately I don't
have that time right now.

I think it'd be interesting to try to determine specifically where
that overhead is coming from. Pick the test case where it's the worst
(1.3%) and do a "perf" with and without the patch and look at the
difference in the call graph. It's possible we could have changes on
that order of magnitude just from more or less fortuitous code layout
decisions as code shifts around, but it's also possible that there's a
real effect there we should think harder about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#53)
Re: PoC: Partial sort

On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Done. Patch is splitted.

I took a quick look at this.

Have you thought about making your new cmpSortSkipCols() function not
use real comparisons? Since in the circumstances in which this
optimization is expected to be effective (e.g. your original example)
we can also expect a relatively low cardinality for the first n
indexed attributes (again, as in your original example), in general
when cmpSortSkipCols() is called there is a high chance that it will
return true. If any pair of tuples (logically adjacent tuples fed in
to cmpSortSkipCols() by an index scan in logical order) are not fully
equal (i.e. their leading, indexed attributes are not equal) then we
don't care about the details -- we just know that a new sort grouping
is required.

The idea here is that you can get away with simple binary equality
comparisons, as we do when considering HOT-safety. Of course, you
might find that two bitwise unequal values are equal according to
their ordinary B-Tree support function 1 comparator (e.g. two numerics
that differ only in their display scale). AFAICT this should be okay,
since that just means that you have smaller sort groupings than
strictly necessary. I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

A similar idea appears in my SortSupport for text ("Poor man's
normalized key"/strxfrm()) patch. A poor man's key comparison didn't
work out, and there may be further differences that aren't captured in
the special simple key representation, so we need to do a "proper
comparison" to figure it out for sure. However, within the sortsupport
routine comparator, we know that we're being called in this context,
as a tie-breaker for a poor man's normalized key comparison that
returned 0, and so are optimistic about the two datums being fully
equal. An optimistic memcmp() is attempted before a strcoll() here if
the lengths also match.

I have not actually added special hints so that we're optimistic about
keys being equal in other places (places that have nothing to do with
the general idea of poor man's normalized keys), but that might not be
a bad idea. Actually, it might not be a bad idea to just always have
varstr_cmp() attempt a memcmp() first when two texts have equal
length, no matter how it's called.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59David Rowley
dgrowleyml@gmail.com
In reply to: Alexander Korotkov (#53)
1 attachment(s)
Re: PoC: Partial sort

On Tue, Feb 11, 2014 at 7:59 AM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Done. Patch is splitted.

I've started to look at this, and for now I'm still finding my way around
the patch, so I'm not quite there yet with understanding everything.
Never-the-less it seems best to post my comments early, so as to help
maintain concurrency between the review and getting the patch into shape.

I've only been looking at partial-sort-basic-1.patch so far;

The patch no longer applies to master, but this was only due to a tab being
replaced by 2 spaces in a pgident run. I've attached an updated patch which
currently applies without any issues.

Here's a few notes from reading over the code:

* pathkeys.c

EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

You can use linitial() instead of lfirst(list_head()). The same thing
occurs in costsize.c

* pathkeys.c

The following fragment:

n = pathkeys_common(root->query_pathkeys, pathkeys);

if (n != 0)
{
/* It's useful ... or at least the first N keys are */
return n;
}

return 0; /* path ordering not useful */
}

Could just read:

/* return the number of path keys in common, or 0 if there are none */
return pathkeys_common(root->query_pathkeys, pathkeys);

* execnodes.h

In struct SortState, some new fields don't have a comment.

I've also thrown a few different workloads at the patch and I'm very
impressed with most of the results. Especially when LIMIT is used, however
I've found a regression case which I thought I should highlight, but for
now I can't quite see what could be done to fix it.

create table a (x int not null, y int not null);
insert into a select x.x,y.y from generate_series(1,1000000) x(x) cross
join generate_series(1,10) y(y);

Patched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=92.42..163.21 rows=10 width=8) (actual
time=6239.426..6239.429 rows=10 loops=1)
-> Partial sort (cost=92.42..354064.37 rows=50000 width=8) (actual
time=6239.406..6239.407 rows=10 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
-> Index Scan using a_x_idx on a (cost=0.44..353939.13
rows=50000 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 9999990
Planning time: 0.212 ms
Execution time: 6239.505 ms
(10 rows)

Time: 6241.220 ms

Unpatched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=195328.26..195328.28 rows=10 width=8) (actual
time=3077.759..3077.761 rows=10 loops=1)
-> Sort (cost=195328.26..195453.26 rows=50000 width=8) (actual
time=3077.757..3077.757 rows=10 loops=1)
Sort Key: x, y
Sort Method: quicksort Memory: 25kB
-> Seq Scan on a (cost=0.00..194247.77 rows=50000 width=8)
(actual time=0.018..3077.705 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 9999990
Planning time: 0.510 ms
Execution time: 3077.837 ms
(9 rows)

Time: 3080.201 ms

As you can see, the patched version performs an index scan in order to get
the partially sorted results, but it does end up quite a bit slower than
the seqscan/sort that the unpatched master performs. I'm not quite sure how
realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if
something like x+y = 1 was performed too.... After a bit more analysis on
this, I see that if I change the 50k estimate to 10 in the debugger that
the num_groups is properly estimated at 1 and it then performs the seq scan
instead. So it looks like the costings of the patch are not to blame here.
(The 50k row estimate comes from rel tuples / DEFAULT_NUM_DISTINCT)

That's all I have at the moment... More to follow soon.

Regards

David Rowley

Attachments:

partial-sort-basic-1_rebased.patchapplication/octet-stream; name=partial-sort-basic-1_rebased.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..6b19e7e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -81,7 +81,7 @@ static void show_agg_keys(AggState *astate, List *ancestors,
 static void show_group_keys(GroupState *gstate, List *ancestors,
 				ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
@@ -940,7 +940,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			pname = sname = "Materialize";
 			break;
 		case T_Sort:
-			pname = sname = "Sort";
+			if (((Sort *) plan)->skipCols > 0)
+				pname = sname = "Partial sort";
+			else
+				pname = sname = "Sort";
 			break;
 		case T_Group:
 			pname = sname = "Group";
@@ -1751,7 +1754,7 @@ show_sort_keys(SortState *sortstate, List *ancestors, ExplainState *es)
 	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
 
 	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, plan->skipCols, plan->sortColIdx,
 						 ancestors, es);
 }
 
@@ -1765,7 +1768,7 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
 	show_sort_group_keys((PlanState *) mstate, "Sort Key",
-						 plan->numCols, plan->sortColIdx,
+						 plan->numCols, 0, plan->sortColIdx,
 						 ancestors, es);
 }
 
@@ -1783,7 +1786,7 @@ show_agg_keys(AggState *astate, List *ancestors,
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(astate, ancestors);
 		show_sort_group_keys(outerPlanState(astate), "Group Key",
-							 plan->numCols, plan->grpColIdx,
+							 plan->numCols, 0, plan->grpColIdx,
 							 ancestors, es);
 		ancestors = list_delete_first(ancestors);
 	}
@@ -1801,7 +1804,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
 	/* The key columns refer to the tlist of the child plan */
 	ancestors = lcons(gstate, ancestors);
 	show_sort_group_keys(outerPlanState(gstate), "Group Key",
-						 plan->numCols, plan->grpColIdx,
+						 plan->numCols, 0, plan->grpColIdx,
 						 ancestors, es);
 	ancestors = list_delete_first(ancestors);
 }
@@ -1811,13 +1814,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  * as arrays of targetlist indexes
  */
 static void
-show_sort_group_keys(PlanState *planstate, const char *qlabel,
-					 int nkeys, AttrNumber *keycols,
+show_sort_group_keys(PlanState *planstate,  const char *qlabel,
+					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
 					 List *ancestors, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	List	   *context;
-	List	   *result = NIL;
+	List	   *resultSort = NIL;
+	List	   *resultPresorted = NIL;
 	bool		useprefix;
 	int			keyno;
 	char	   *exprstr;
@@ -1844,10 +1848,15 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 		/* Deparse the expression, showing any top-level cast */
 		exprstr = deparse_expression((Node *) target->expr, context,
 									 useprefix, true);
-		result = lappend(result, exprstr);
+
+		if (keyno < nPresortedKeys)
+			resultPresorted = lappend(resultPresorted, exprstr);
+		resultSort = lappend(resultSort, exprstr);
 	}
 
-	ExplainPropertyList(qlabel, result, es);
+	ExplainPropertyList(qlabel, resultSort, es);
+	if (nPresortedKeys > 0)
+		ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 640964c..a8e69d2 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -379,7 +379,7 @@ ExecRestrPos(PlanState *node)
  * and valuesscan support is actually useless code at present.)
  */
 bool
-ExecSupportsMarkRestore(NodeTag plantype)
+ExecSupportsMarkRestore(NodeTag plantype, Plan *node)
 {
 	switch (plantype)
 	{
@@ -389,9 +389,15 @@ ExecSupportsMarkRestore(NodeTag plantype)
 		case T_TidScan:
 		case T_ValuesScan:
 		case T_Material:
-		case T_Sort:
 			return true;
 
+		case T_Sort:
+			/* With skipCols sort node holds only last bucket */
+			if (node && ((Sort *)node)->skipCols == 0)
+				return true;
+			else
+				return false;
+
 		case T_Result:
 
 			/*
@@ -466,10 +472,16 @@ ExecSupportsBackwardScan(Plan *node)
 				TargetListSupportsBackwardScan(node->targetlist);
 
 		case T_Material:
-		case T_Sort:
 			/* these don't evaluate tlist */
 			return true;
 
+		case T_Sort:
+			/* With skipCols sort node holds only last bucket */
+			if (((Sort *)node)->skipCols == 0)
+				return true;
+			else
+				return false;
+
 		case T_LockRows:
 		case T_Limit:
 			/* these don't evaluate tlist */
@@ -535,7 +547,7 @@ IndexSupportsBackwardScan(Oid indexid)
  * very low per-tuple cost.
  */
 bool
-ExecMaterializesOutput(NodeTag plantype)
+ExecMaterializesOutput(NodeTag plantype, Plan *node)
 {
 	switch (plantype)
 	{
@@ -543,9 +555,15 @@ ExecMaterializesOutput(NodeTag plantype)
 		case T_FunctionScan:
 		case T_CteScan:
 		case T_WorkTableScan:
-		case T_Sort:
 			return true;
 
+		case T_Sort:
+			/* With skipCols sort node holds only last bucket */
+			if (node && ((Sort *)node)->skipCols == 0)
+				return true;
+			else
+				return false;
+
 		default:
 			break;
 	}
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 47ed068..c51a144 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -126,19 +126,11 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 * initialize sort-key information
 	 */
 	mergestate->ms_nkeys = node->numCols;
-	mergestate->ms_sortkeys = palloc0(sizeof(SortSupportData) * node->numCols);
-
-	for (i = 0; i < node->numCols; i++)
-	{
-		SortSupport sortKey = mergestate->ms_sortkeys + i;
-
-		sortKey->ssup_cxt = CurrentMemoryContext;
-		sortKey->ssup_collation = node->collations[i];
-		sortKey->ssup_nulls_first = node->nullsFirst[i];
-		sortKey->ssup_attno = node->sortColIdx[i];
-
-		PrepareSortSupportFromOrderingOp(node->sortOperators[i], sortKey);
-	}
+	mergestate->ms_sortkeys = MakeSortSupportKeys(mergestate->ms_nkeys,
+												  node->sortColIdx,
+												  node->sortOperators,
+												  node->collations,
+												  node->nullsFirst);
 
 	/*
 	 * initialize to show we have not run the subplans yet
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b88571b..f38190d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -15,11 +15,37 @@
 
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "executor/execdebug.h"
 #include "executor/nodeSort.h"
 #include "miscadmin.h"
 #include "utils/tuplesort.h"
 
+/*
+ * Check if first "skipCols" sort values are equal.
+ */
+static bool
+cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+{
+	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+
+	for (i = 0; i < n; i++)
+	{
+		Datum datumA, datumB;
+		bool isnullA, isnullB;
+		AttrNumber attno = node->skipKeys[i].ssup_attno;
+
+		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+		datumB = slot_getattr(b, attno, &isnullB);
+
+		if (ApplySortComparator(datumA, isnullA,
+								datumB, isnullB,
+								&node->skipKeys[i]))
+			return false;
+	}
+	return true;
+}
+
 
 /* ----------------------------------------------------------------
  *		ExecSort
@@ -42,6 +68,11 @@ ExecSort(SortState *node)
 	ScanDirection dir;
 	Tuplesortstate *tuplesortstate;
 	TupleTableSlot *slot;
+	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
+	int			skipCols = plannode->skipCols;
+	int64		nTuples = 0;
 
 	/*
 	 * get state info from node
@@ -54,79 +85,148 @@ ExecSort(SortState *node)
 	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
 
 	/*
+	 * Return next tuple from sorted set if any.
+	 */
+	if (node->sort_Done)
+	{
+		slot = node->ss.ps.ps_ResultTupleSlot;
+		if (tuplesort_gettupleslot(tuplesortstate,
+									  ScanDirectionIsForward(dir),
+									  slot) || node->finished)
+			return slot;
+	}
+
+	/*
 	 * If first time through, read all tuples from outer plan and pass them to
 	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
 	 */
 
-	if (!node->sort_Done)
-	{
-		Sort	   *plannode = (Sort *) node->ss.ps.plan;
-		PlanState  *outerNode;
-		TupleDesc	tupDesc;
-
-		SO1_printf("ExecSort: %s\n",
-				   "sorting subplan");
+	SO1_printf("ExecSort: %s\n",
+			   "sorting subplan");
 
-		/*
-		 * Want to scan subplan in the forward direction while creating the
-		 * sorted data.
-		 */
-		estate->es_direction = ForwardScanDirection;
+	/*
+	 * Want to scan subplan in the forward direction while creating the
+	 * sorted data.
+	 */
+	estate->es_direction = ForwardScanDirection;
 
-		/*
-		 * Initialize tuplesort module.
-		 */
-		SO1_printf("ExecSort: %s\n",
-				   "calling tuplesort_begin");
+	/*
+	 * Initialize tuplesort module.
+	 */
+	SO1_printf("ExecSort: %s\n",
+			   "calling tuplesort_begin");
 
-		outerNode = outerPlanState(node);
-		tupDesc = ExecGetResultType(outerNode);
+	outerNode = outerPlanState(node);
+	tupDesc = ExecGetResultType(outerNode);
 
+	if (node->tuplesortstate != NULL)
+		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+	else
+	{
+		/* Support structures for cmpSortSkipCols - already sorted columns */
+		if (skipCols)
+			node->skipKeys = MakeSortSupportKeys(skipCols,
+												 plannode->sortColIdx,
+												 plannode->sortOperators,
+												 plannode->collations,
+												 plannode->nullsFirst);
+
+		/* Only pass on remaining columns that are unsorted */
 		tuplesortstate = tuplesort_begin_heap(tupDesc,
-											  plannode->numCols,
-											  plannode->sortColIdx,
-											  plannode->sortOperators,
-											  plannode->collations,
-											  plannode->nullsFirst,
+											  plannode->numCols - skipCols,
+											  &(plannode->sortColIdx[skipCols]),
+											  &(plannode->sortOperators[skipCols]),
+											  &(plannode->collations[skipCols]),
+											  &(plannode->nullsFirst[skipCols]),
 											  work_mem,
 											  node->randomAccess);
-		if (node->bounded)
-			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
+	}
 
-		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
-		 */
+	if (node->bounded)
+		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
 
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
+	/*
+	 * Put next group of tuples where skipCols" sort values are equal to
+	 * tuplesort.
+	 */
+	for (;;)
+	{
+		slot = ExecProcNode(outerNode);
 
+		if (skipCols == 0)
+		{
 			if (TupIsNull(slot))
+			{
+				node->finished = true;
 				break;
-
+			}
 			tuplesort_puttupleslot(tuplesortstate, slot);
+			nTuples++;
 		}
+		else if (node->prev)
+		{
+			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+			nTuples++;
 
-		/*
-		 * Complete the sort.
-		 */
-		tuplesort_performsort(tuplesortstate);
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+			else
+			{
+				bool cmp;
+				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
+				node->prev = ExecCopySlotTuple(slot);
+				if (!cmp)
+					break;
+			}
+		}
+		else
+		{
+			if (TupIsNull(slot))
+			{
+				node->finished = true;
+				break;
+			}
+			else
+			{
+				node->prev = ExecCopySlotTuple(slot);
+			}
+		}
+	}
 
-		/*
-		 * restore to user specified direction
-		 */
-		estate->es_direction = dir;
+	/*
+	 * Complete the sort.
+	 */
+	tuplesort_performsort(tuplesortstate);
 
-		/*
-		 * finally set the sorted flag to true
-		 */
-		node->sort_Done = true;
-		node->bounded_Done = node->bounded;
-		node->bound_Done = node->bound;
-		SO1_printf("ExecSort: %s\n", "sorting done");
+	/*
+	 * restore to user specified direction
+	 */
+	estate->es_direction = dir;
+
+	/*
+	 * finally set the sorted flag to true
+	 */
+	node->sort_Done = true;
+	node->bounded_Done = node->bounded;
+
+	/*
+	 * Adjust bound_Done with number of tuples we've actually sorted.
+	 */
+	if (node->bounded)
+	{
+		if (node->finished)
+			node->bound_Done = node->bound;
+		else
+			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
 	}
 
+	SO1_printf("ExecSort: %s\n", "sorting done");
+
 	SO1_printf("ExecSort: %s\n",
 			   "retrieving tuple from tuplesort");
 
@@ -157,6 +257,15 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 			   "initializing sort node");
 
 	/*
+	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+	 * tuplesortstate.
+	 */
+	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+											 EXEC_FLAG_BACKWARD |
+											 EXEC_FLAG_MARK)) == 0);
+
+	/*
 	 * create state structure
 	 */
 	sortstate = makeNode(SortState);
@@ -174,7 +283,10 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 
 	sortstate->bounded = false;
 	sortstate->sort_Done = false;
+	sortstate->finished = false;
 	sortstate->tuplesortstate = NULL;
+	sortstate->prev = NULL;
+	sortstate->bound_Done = 0;
 
 	/*
 	 * Miscellaneous initialization
@@ -316,6 +428,7 @@ ExecReScanSort(SortState *node)
 		node->sort_Done = false;
 		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
 		node->tuplesortstate = NULL;
+		node->bound_Done = 0;
 
 		/*
 		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3088578..43f7089 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -735,6 +735,7 @@ _copySort(const Sort *from)
 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
 
 	COPY_SCALAR_FIELD(numCols);
+	COPY_SCALAR_FIELD(skipCols);
 	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
 	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
 	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0cdb790..314d3ab 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1235,15 +1235,22 @@ cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm)
  */
 void
 cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples)
 {
-	Cost		startup_cost = input_cost;
-	Cost		run_cost = 0;
+	Cost		startup_cost = input_startup_cost;
+	Cost		run_cost = 0,
+				rest_cost,
+				group_cost,
+				input_run_cost = input_total_cost - input_startup_cost;
 	double		input_bytes = relation_byte_size(tuples, width);
 	double		output_bytes;
 	double		output_tuples;
+	double		num_groups,
+				group_input_bytes,
+				group_tuples;
 	long		sort_mem_bytes = sort_mem * 1024L;
 
 	if (!enable_sort)
@@ -1273,13 +1280,47 @@ cost_sort(Path *path, PlannerInfo *root,
 		output_bytes = input_bytes;
 	}
 
-	if (output_bytes > sort_mem_bytes)
+	/*
+	 * Estimate number of groups which dataset is divided by presorted keys.
+	 */
+	if (presorted_keys > 0)
+	{
+		List *groupExprs = NIL;
+		ListCell *l;
+		int i = 0;
+
+		foreach(l, pathkeys)
+		{
+			PathKey *key = (PathKey *)lfirst(l);
+			EquivalenceMember *member = (EquivalenceMember *)
+								lfirst(list_head(key->pk_eclass->ec_members));
+
+			groupExprs = lappend(groupExprs, member->em_expr);
+
+			i++;
+			if (i >= presorted_keys)
+				break;
+		}
+
+		num_groups = estimate_num_groups(root, groupExprs, tuples);
+	}
+	else
+	{
+		num_groups = 1.0;
+	}
+
+	/*
+	 * Estimate average cost of one group sorting
+	 */
+	group_input_bytes = input_bytes / num_groups;
+	group_tuples = tuples / num_groups;
+	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
-		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
+		double		npages = ceil(group_input_bytes / BLCKSZ);
+		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1289,7 +1330,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 *
 		 * Assume about N log2 N comparisons
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 
 		/* Disk costs */
 
@@ -1300,10 +1341,10 @@ cost_sort(Path *path, PlannerInfo *root,
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
 		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
-		startup_cost += npageaccesses *
+		group_cost += npageaccesses *
 			(seq_page_cost * 0.75 + random_page_cost * 0.25);
 	}
-	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
 	{
 		/*
 		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
@@ -1311,15 +1352,26 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * factor is a bit higher than for quicksort.  Tweak it so that the
 		 * cost curve is continuous at the crossover point.
 		 */
-		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
 	}
 	else
 	{
 		/* We'll use plain quicksort on all the input tuples */
-		startup_cost += comparison_cost * tuples * LOG2(tuples);
+		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
 	}
 
 	/*
+	 * We've to sort first group to start output from node. Sorting rest of
+	 * groups are required to return all the tuples.
+	 */
+	startup_cost += group_cost;
+	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+	if (rest_cost > 0.0)
+		run_cost += rest_cost;
+	startup_cost += input_run_cost / num_groups;
+	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+
+	/*
 	 * Also charge a small amount (arbitrarily set equal to operator cost) per
 	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
 	 * doesn't do qual-checking or projection, so it has less overhead than
@@ -2029,6 +2081,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  outersortkeys,
+				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+				  outer_path->startup_cost,
 				  outer_path->total_cost,
 				  outer_path_rows,
 				  outer_path->parent->width,
@@ -2055,6 +2109,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 		cost_sort(&sort_path,
 				  root,
 				  innersortkeys,
+				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+				  inner_path->startup_cost,
 				  inner_path->total_cost,
 				  inner_path_rows,
 				  inner_path->parent->width,
@@ -2266,7 +2322,7 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	 * it off does not entitle us to deliver an invalid plan.
 	 */
 	else if (innersortkeys == NIL &&
-			 !ExecSupportsMarkRestore(inner_path->pathtype))
+			 !ExecSupportsMarkRestore(inner_path->pathtype, NULL))
 		path->materialize_inner = true;
 
 	/*
@@ -2780,7 +2836,7 @@ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
 		 * every time.
 		 */
 		if (subplan->parParam == NIL &&
-			ExecMaterializesOutput(nodeTag(plan)))
+			ExecMaterializesOutput(nodeTag(plan), plan))
 			sp_cost.startup += plan->startup_cost;
 		else
 			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index be54f3d..7bbad4f 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -820,7 +820,7 @@ match_unsorted_outer(PlannerInfo *root,
 		 * output anyway.
 		 */
 		if (enable_material && inner_cheapest_total != NULL &&
-			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
+			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
 			matpath = (Path *)
 				create_material_path(innerrel, inner_cheapest_total);
 	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 5d953df..0a9d6f7 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -21,11 +21,13 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
+#include "optimizer/cost.h"
 #include "optimizer/clauses.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 
 
 static PathKey *make_canonical_pathkey(PlannerInfo *root,
@@ -312,6 +314,32 @@ compare_pathkeys(List *keys1, List *keys2)
 }
 
 /*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+	int n;
+	ListCell   *key1,
+			   *key2;
+	n = 0;
+
+	forboth(key1, keys1, key2, keys2)
+	{
+		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+		if (pathkey1 != pathkey2)
+			return n;
+		n++;
+	}
+
+	return n;
+}
+
+
+/*
  * pathkeys_contained_in
  *	  Common special case of compare_pathkeys: we just want to know
  *	  if keys2 are at least as well sorted as keys1.
@@ -369,9 +397,36 @@ get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 }
 
 /*
+ * Compare cost of two paths assuming different fractions of tuples be returned
+ * from each paths.
+ */
+static int
+compare_bifractional_path_costs(Path *path1, Path *path2,
+							  double fraction1, double fraction2)
+{
+	Cost		cost1,
+				cost2;
+
+	if (fraction1 <= 0.0 || fraction1 >= 1.0 ||
+			fraction2 <= 0.0 || fraction2 >= 1.0)
+		return compare_path_costs(path1, path2, TOTAL_COST);
+	cost1 = path1->startup_cost +
+		fraction1 * (path1->total_cost - path1->startup_cost);
+	cost2 = path2->startup_cost +
+		fraction2 * (path2->total_cost - path2->startup_cost);
+	if (cost1 < cost2)
+		return -1;
+	if (cost1 > cost2)
+		return +1;
+	return 0;
+}
+
+/*
  * get_cheapest_fractional_path_for_pathkeys
  *	  Find the cheapest path (for retrieving a specified fraction of all
- *	  the tuples) that satisfies the given pathkeys and parameterization.
+ *	  the tuples) that satisfies given parameterization and at least partially
+ *	  satisfies the given pathkeys. Compares paths according to different
+ *	  fraction of tuples be extracted to start with partial sort.
  *	  Return NULL if no such path.
  *
  * See compare_fractional_path_costs() for the interpretation of the fraction
@@ -386,26 +441,84 @@ Path *
 get_cheapest_fractional_path_for_pathkeys(List *paths,
 										  List *pathkeys,
 										  Relids required_outer,
-										  double fraction)
+										  double fraction,
+										  PlannerInfo *root,
+										  double tuples)
 {
 	Path	   *matched_path = NULL;
+	int			matched_n_common_pathkeys = 0,
+				costs_cmp, n_common_pathkeys,
+				n_pathkeys = list_length(pathkeys);
 	ListCell   *l;
+	List	   *groupExprs = NIL;
+	double	   *num_groups, matched_fraction;
+	int			i;
+
+	/*
+	 * Get number of groups for each possible partial sort.
+	 */
+	i = 0;
+	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+	foreach(l, pathkeys)
+	{
+		PathKey *key = (PathKey *)lfirst(l);
+		EquivalenceMember *member = (EquivalenceMember *)
+							lfirst(list_head(key->pk_eclass->ec_members));
+
+		groupExprs = lappend(groupExprs, member->em_expr);
+
+		num_groups[i] = estimate_num_groups(root, groupExprs, tuples);
+		i++;
+	}
+
 
 	foreach(l, paths)
 	{
 		Path	   *path = (Path *) lfirst(l);
+		double		current_fraction;
+
+		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+		if (n_common_pathkeys < matched_n_common_pathkeys ||
+				n_common_pathkeys == 0)
+			continue;
 
 		/*
-		 * Since cost comparison is a lot cheaper than pathkey comparison, do
-		 * that first.  (XXX is that still true?)
+		 * Estimate fraction of outer tuples be fetched to start returning
+		 * tuples from partial sort.
 		 */
-		if (matched_path != NULL &&
-			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
-			continue;
+		current_fraction = fraction;
+		if (n_common_pathkeys < n_pathkeys)
+		{
+			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
+			current_fraction = Max(current_fraction, 1.0);
+		}
 
-		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
+		/*
+		 * Do cost comparison.
+		 */
+		if (matched_path != NULL)
+		{
+			costs_cmp = compare_bifractional_path_costs(matched_path, path,
+					matched_fraction, current_fraction);
+		}
+		else
+		{
+			costs_cmp = 1;
+		}
+
+		/*
+		 * Always prefer best number of common pathkeys.
+		 */
+		if ((
+				n_common_pathkeys > matched_n_common_pathkeys
+				||	(n_common_pathkeys == matched_n_common_pathkeys
+					 && costs_cmp > 0)) &&
 			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+		{
 			matched_path = path;
+			matched_n_common_pathkeys = n_common_pathkeys;
+			matched_fraction = current_fraction;
+		}
 	}
 	return matched_path;
 }
@@ -1450,23 +1563,26 @@ right_merge_direction(PlannerInfo *root, PathKey *pathkey)
  *		Count the number of pathkeys that are useful for meeting the
  *		query's requested output ordering.
  *
- * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
- * no good to order by just the first key(s) of the requested ordering.
- * So the result is always either 0 or list_length(root->query_pathkeys).
+ * Returns number of pathkeys that maches given argument. Others can be
+ * satisfied by partial sort.
  */
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+	int n;
+
 	if (root->query_pathkeys == NIL)
 		return 0;				/* no special ordering requested */
 
 	if (pathkeys == NIL)
 		return 0;				/* unordered path */
 
-	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
+	n = pathkeys_common(root->query_pathkeys, pathkeys);
+
+	if (n != 0)
 	{
 		/* It's useful ... or at least the first N keys are */
-		return list_length(root->query_pathkeys);
+		return n;
 	}
 
 	return 0;					/* path ordering not useful */
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 4b641a2..129ea40 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -149,6 +149,7 @@ static MergeJoin *make_mergejoin(List *tlist,
 			   Plan *lefttree, Plan *righttree,
 			   JoinType jointype);
 static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+		  List *pathkeys, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst,
 		  double limit_tuples);
@@ -774,6 +775,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 		Oid		   *sortOperators;
 		Oid		   *collations;
 		bool	   *nullsFirst;
+		int			n_common_pathkeys;
 
 		/* Build the child plan */
 		subplan = create_plan_recurse(root, subpath);
@@ -807,8 +809,10 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path)
 					  numsortkeys * sizeof(bool)) == 0);
 
 		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
-		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
+		if (n_common_pathkeys < list_length(pathkeys))
 			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+										 pathkeys, n_common_pathkeys,
 										 sortColIdx, sortOperators,
 										 collations, nullsFirst,
 										 best_path->limit_tuples);
@@ -2181,9 +2185,11 @@ create_mergejoin_plan(PlannerInfo *root,
 		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
 		outer_plan = (Plan *)
 			make_sort_from_pathkeys(root,
-									outer_plan,
-									best_path->outersortkeys,
-									-1.0);
+								outer_plan,
+								best_path->outersortkeys,
+								-1.0,
+								pathkeys_common(best_path->outersortkeys,
+									best_path->jpath.outerjoinpath->pathkeys));
 		outerpathkeys = best_path->outersortkeys;
 	}
 	else
@@ -2194,9 +2200,11 @@ create_mergejoin_plan(PlannerInfo *root,
 		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
 		inner_plan = (Plan *)
 			make_sort_from_pathkeys(root,
-									inner_plan,
-									best_path->innersortkeys,
-									-1.0);
+								inner_plan,
+								best_path->innersortkeys,
+								-1.0,
+								pathkeys_common(best_path->innersortkeys,
+									best_path->jpath.innerjoinpath->pathkeys));
 		innerpathkeys = best_path->innersortkeys;
 	}
 	else
@@ -3736,6 +3744,7 @@ make_mergejoin(List *tlist,
  */
 static Sort *
 make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+          List *pathkeys, int skipCols,
 		  AttrNumber *sortColIdx, Oid *sortOperators,
 		  Oid *collations, bool *nullsFirst,
 		  double limit_tuples)
@@ -3745,7 +3754,8 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
 	Path		sort_path;		/* dummy for result of cost_sort */
 
 	copy_plan_costsize(plan, lefttree); /* only care about copying size */
-	cost_sort(&sort_path, root, NIL,
+	cost_sort(&sort_path, root, pathkeys, skipCols,
+			  lefttree->startup_cost,
 			  lefttree->total_cost,
 			  lefttree->plan_rows,
 			  lefttree->plan_width,
@@ -3759,6 +3769,7 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
 	plan->lefttree = lefttree;
 	plan->righttree = NULL;
 	node->numCols = numCols;
+	node->skipCols = skipCols;
 	node->sortColIdx = sortColIdx;
 	node->sortOperators = sortOperators;
 	node->collations = collations;
@@ -4087,7 +4098,7 @@ find_ec_member_for_tle(EquivalenceClass *ec,
  */
 Sort *
 make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
-						double limit_tuples)
+						double limit_tuples, int skipCols)
 {
 	int			numsortkeys;
 	AttrNumber *sortColIdx;
@@ -4107,7 +4118,7 @@ make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 										  &nullsFirst);
 
 	/* Now build the Sort node */
-	return make_sort(root, lefttree, numsortkeys,
+	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
 					 sortColIdx, sortOperators, collations,
 					 nullsFirst, limit_tuples);
 }
@@ -4150,7 +4161,7 @@ make_sort_from_sortclauses(PlannerInfo *root, List *sortcls, Plan *lefttree)
 		numsortkeys++;
 	}
 
-	return make_sort(root, lefttree, numsortkeys,
+	return make_sort(root, lefttree, numsortkeys, NIL, 0,
 					 sortColIdx, sortOperators, collations,
 					 nullsFirst, -1.0);
 }
@@ -4172,7 +4183,8 @@ Sort *
 make_sort_from_groupcols(PlannerInfo *root,
 						 List *groupcls,
 						 AttrNumber *grpColIdx,
-						 Plan *lefttree)
+						 Plan *lefttree,
+						 List *pathkeys, int skipCols)
 {
 	List	   *sub_tlist = lefttree->targetlist;
 	ListCell   *l;
@@ -4205,7 +4217,7 @@ make_sort_from_groupcols(PlannerInfo *root,
 		numsortkeys++;
 	}
 
-	return make_sort(root, lefttree, numsortkeys,
+	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
 					 sortColIdx, sortOperators, collations,
 					 nullsFirst, -1.0);
 }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index 94ca92d..7eea24e 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -494,7 +494,9 @@ build_minmax_path(PlannerInfo *root, MinMaxAggInfo *mminfo,
 		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
 												  subroot->query_pathkeys,
 												  NULL,
-												  path_fraction);
+												  path_fraction,
+												  subroot,
+												  final_rel->rows);
 	if (!sorted_path)
 		return false;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e1480cd..58abc43 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1394,7 +1394,9 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
 													  root->query_pathkeys,
 													  NULL,
-													  tuple_fraction);
+													  tuple_fraction,
+													  root,
+													  path_rows);
 
 		/* Don't consider same path in both guises; just wastes effort */
 		if (sorted_path == cheapest_path)
@@ -1410,10 +1412,14 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 		if (sorted_path)
 		{
 			Path		sort_path;		/* dummy for result of cost_sort */
+			Path		partial_sort_path;	/* dummy for result of cost_sort */
+			int			n_common_pathkeys;
+
+			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+												cheapest_path->pathkeys);
 
 			if (root->query_pathkeys == NIL ||
-				pathkeys_contained_in(root->query_pathkeys,
-									  cheapest_path->pathkeys))
+					n_common_pathkeys == list_length(root->query_pathkeys))
 			{
 				/* No sort needed for cheapest path */
 				sort_path.startup_cost = cheapest_path->startup_cost;
@@ -1423,12 +1429,35 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 			{
 				/* Figure cost for sorting */
 				cost_sort(&sort_path, root, root->query_pathkeys,
+						  n_common_pathkeys,
+						  cheapest_path->startup_cost,
 						  cheapest_path->total_cost,
 						  path_rows, path_width,
 						  0.0, work_mem, root->limit_tuples);
 			}
 
-			if (compare_fractional_path_costs(sorted_path, &sort_path,
+			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+												sorted_path->pathkeys);
+
+			if (root->query_pathkeys == NIL ||
+					n_common_pathkeys == list_length(root->query_pathkeys))
+			{
+				/* No sort needed for cheapest path */
+				partial_sort_path.startup_cost = sorted_path->startup_cost;
+				partial_sort_path.total_cost = sorted_path->total_cost;
+			}
+			else
+			{
+				/* Figure cost for sorting */
+				cost_sort(&partial_sort_path, root, root->query_pathkeys,
+						  n_common_pathkeys,
+						  sorted_path->startup_cost,
+						  sorted_path->total_cost,
+						  path_rows, path_width,
+						  0.0, work_mem, root->limit_tuples);
+			}
+
+			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
 											  tuple_fraction) > 0)
 			{
 				/* Presorted path is a loser */
@@ -1509,13 +1538,16 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 			 * results.
 			 */
 			bool		need_sort_for_grouping = false;
+			int			n_common_pathkeys_grouping;
 
 			result_plan = create_plan(root, best_path);
 			current_pathkeys = best_path->pathkeys;
 
 			/* Detect if we'll need an explicit sort for grouping */
+			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+														 current_pathkeys);
 			if (parse->groupClause && !use_hashed_grouping &&
-			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
+				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
 			{
 				need_sort_for_grouping = true;
 
@@ -1609,7 +1641,9 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 							make_sort_from_groupcols(root,
 													 parse->groupClause,
 													 groupColIdx,
-													 result_plan);
+													 result_plan,
+													 root->group_pathkeys,
+													n_common_pathkeys_grouping);
 						current_pathkeys = root->group_pathkeys;
 					}
 					aggstrategy = AGG_SORTED;
@@ -1652,7 +1686,9 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 						make_sort_from_groupcols(root,
 												 parse->groupClause,
 												 groupColIdx,
-												 result_plan);
+												 result_plan,
+												 root->group_pathkeys,
+												 n_common_pathkeys_grouping);
 					current_pathkeys = root->group_pathkeys;
 				}
 
@@ -1769,13 +1805,17 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 				if (window_pathkeys)
 				{
 					Sort	   *sort_plan;
+					int			n_common_pathkeys;
+
+					n_common_pathkeys = pathkeys_common(window_pathkeys,
+													    current_pathkeys);
 
 					sort_plan = make_sort_from_pathkeys(root,
 														result_plan,
 														window_pathkeys,
-														-1.0);
-					if (!pathkeys_contained_in(window_pathkeys,
-											   current_pathkeys))
+														-1.0,
+														n_common_pathkeys);
+					if (n_common_pathkeys < list_length(window_pathkeys))
 					{
 						/* we do indeed need to sort */
 						result_plan = (Plan *) sort_plan;
@@ -1921,19 +1961,21 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 			{
 				if (list_length(root->distinct_pathkeys) >=
 					list_length(root->sort_pathkeys))
-					current_pathkeys = root->distinct_pathkeys;
+					needed_pathkeys = root->distinct_pathkeys;
 				else
 				{
-					current_pathkeys = root->sort_pathkeys;
+					needed_pathkeys = root->sort_pathkeys;
 					/* Assert checks that parser didn't mess up... */
 					Assert(pathkeys_contained_in(root->distinct_pathkeys,
-												 current_pathkeys));
+												 needed_pathkeys));
 				}
 
 				result_plan = (Plan *) make_sort_from_pathkeys(root,
 															   result_plan,
-															current_pathkeys,
-															   -1.0);
+															   needed_pathkeys,
+															   -1.0,
+							pathkeys_common(needed_pathkeys, current_pathkeys));
+				current_pathkeys = needed_pathkeys;
 			}
 
 			result_plan = (Plan *) make_unique(result_plan,
@@ -1949,12 +1991,15 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 	 */
 	if (parse->sortClause)
 	{
-		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
+		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
+		
+		if (common < list_length(root->sort_pathkeys))
 		{
 			result_plan = (Plan *) make_sort_from_pathkeys(root,
 														   result_plan,
 														 root->sort_pathkeys,
-														   limit_tuples);
+														   limit_tuples,
+														   common);
 			current_pathkeys = root->sort_pathkeys;
 		}
 	}
@@ -2698,6 +2743,7 @@ choose_hashed_grouping(PlannerInfo *root,
 	List	   *current_pathkeys;
 	Path		hashed_p;
 	Path		sorted_p;
+	int			n_common_pathkeys;
 
 	/*
 	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
@@ -2779,7 +2825,8 @@ choose_hashed_grouping(PlannerInfo *root,
 			 path_rows);
 	/* Result of hashed agg is always unsorted */
 	if (target_pathkeys)
-		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
+		cost_sort(&hashed_p, root, target_pathkeys, 0,
+				  hashed_p.startup_cost, hashed_p.total_cost,
 				  dNumGroups, path_width,
 				  0.0, work_mem, limit_tuples);
 
@@ -2795,9 +2842,12 @@ choose_hashed_grouping(PlannerInfo *root,
 		sorted_p.total_cost = cheapest_path->total_cost;
 		current_pathkeys = cheapest_path->pathkeys;
 	}
-	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
+
+	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
+	if (n_common_pathkeys < list_length(root->group_pathkeys))
 	{
-		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
+		cost_sort(&sorted_p, root, root->group_pathkeys,
+				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
 				  path_rows, path_width,
 				  0.0, work_mem, -1.0);
 		current_pathkeys = root->group_pathkeys;
@@ -2812,10 +2862,12 @@ choose_hashed_grouping(PlannerInfo *root,
 		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
 				   sorted_p.startup_cost, sorted_p.total_cost,
 				   path_rows);
+
 	/* The Agg or Group node will preserve ordering */
-	if (target_pathkeys &&
-		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
-		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
+	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
+	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
+		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
+				  sorted_p.startup_cost, sorted_p.total_cost,
 				  dNumGroups, path_width,
 				  0.0, work_mem, limit_tuples);
 
@@ -2868,6 +2920,7 @@ choose_hashed_distinct(PlannerInfo *root,
 	List	   *needed_pathkeys;
 	Path		hashed_p;
 	Path		sorted_p;
+	int			n_common_pathkeys;
 
 	/*
 	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
@@ -2933,7 +2986,8 @@ choose_hashed_distinct(PlannerInfo *root,
 	 * need to charge for the final sort.
 	 */
 	if (parse->sortClause)
-		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
+		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
+				  hashed_p.startup_cost, hashed_p.total_cost,
 				  dNumDistinctRows, path_width,
 				  0.0, work_mem, limit_tuples);
 
@@ -2950,23 +3004,30 @@ choose_hashed_distinct(PlannerInfo *root,
 		needed_pathkeys = root->sort_pathkeys;
 	else
 		needed_pathkeys = root->distinct_pathkeys;
-	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
+
+	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
+	if (n_common_pathkeys < list_length(needed_pathkeys))
 	{
 		if (list_length(root->distinct_pathkeys) >=
 			list_length(root->sort_pathkeys))
 			current_pathkeys = root->distinct_pathkeys;
 		else
 			current_pathkeys = root->sort_pathkeys;
-		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
+		cost_sort(&sorted_p, root, current_pathkeys,
+				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
 				  path_rows, path_width,
 				  0.0, work_mem, -1.0);
 	}
 	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
 			   sorted_p.startup_cost, sorted_p.total_cost,
 			   path_rows);
+
+
+	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
 	if (parse->sortClause &&
-		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
-		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
+		n_common_pathkeys < list_length(root->sort_pathkeys))
+		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
+				  sorted_p.startup_cost, sorted_p.total_cost,
 				  dNumDistinctRows, path_width,
 				  0.0, work_mem, limit_tuples);
 
@@ -3756,8 +3817,9 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of seq scan + sort */
 	seqScanPath = create_seqscan_path(root, rel, NULL);
-	cost_sort(&seqScanAndSortPath, root, NIL,
-			  seqScanPath->total_cost, rel->tuples, rel->width,
+	cost_sort(&seqScanAndSortPath, root, NIL, 0,
+			  seqScanPath->startup_cost, seqScanPath->total_cost,
+			  rel->tuples, rel->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
 
 	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..3f7fbd4 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -780,7 +780,7 @@ build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot,
 		 * unnecessarily, so we don't.
 		 */
 		else if (splan->parParam == NIL && enable_material &&
-				 !ExecMaterializesOutput(nodeTag(plan)))
+				 !ExecMaterializesOutput(nodeTag(plan), plan))
 			plan = materialize_finished_plan(plan);
 
 		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 0410fdd..0f5fee2 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -860,7 +860,8 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	sorted_p.startup_cost = input_plan->startup_cost;
 	sorted_p.total_cost = input_plan->total_cost;
 	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
-	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
+	cost_sort(&sorted_p, root, NIL, 0,
+			  sorted_p.startup_cost, sorted_p.total_cost,
 			  input_plan->plan_rows, input_plan->plan_width,
 			  0.0, work_mem, -1.0);
 	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 319e8b2..48966df 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -970,10 +970,11 @@ create_merge_append_path(PlannerInfo *root,
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
+		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
 
 		pathnode->path.rows += subpath->rows;
 
-		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+		if (n_common_pathkeys == list_length(pathkeys))
 		{
 			/* Subpath is adequately ordered, we won't need to sort it */
 			input_startup_cost += subpath->startup_cost;
@@ -987,6 +988,8 @@ create_merge_append_path(PlannerInfo *root,
 			cost_sort(&sort_path,
 					  root,
 					  pathkeys,
+					  n_common_pathkeys,
+					  subpath->startup_cost,
 					  subpath->total_cost,
 					  subpath->parent->tuples,
 					  subpath->parent->width,
@@ -1346,7 +1349,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 		/*
 		 * Estimate cost for sort+unique implementation
 		 */
-		cost_sort(&sort_path, root, NIL,
+		cost_sort(&sort_path, root, NIL, 0,
+				  subpath->startup_cost,
 				  subpath->total_cost,
 				  rel->rows,
 				  rel->width,
diff --git a/src/backend/utils/sort/sortsupport.c b/src/backend/utils/sort/sortsupport.c
index 2240fd0..de26b7c 100644
--- a/src/backend/utils/sort/sortsupport.c
+++ b/src/backend/utils/sort/sortsupport.c
@@ -86,6 +86,35 @@ PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup)
 }
 
 /*
+ * Build an array of SortSupportData structures from separated arrays.
+ */
+SortSupport
+MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+					Oid *sortOperators, Oid *sortCollations,
+					bool *nullsFirstFlags)
+{
+	SortSupport sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
+	int			i;
+
+	for (i = 0; i < nkeys; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+
+		AssertArg(attNums[i] != 0);
+		AssertArg(sortOperators[i] != 0);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = sortCollations[i];
+		sortKey->ssup_nulls_first = nullsFirstFlags[i];
+		sortKey->ssup_attno = attNums[i];
+
+		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
+	}
+
+	return sortKeys;
+}
+
+/*
  * Fill in SortSupport given an ordering operator (btree "<" or ">" operator).
  *
  * Caller must previously have zeroed the SortSupportData structure and then
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 8e57505..6e28a40 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -604,7 +604,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
 	MemoryContext oldcontext;
-	int			i;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
@@ -632,24 +631,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->reversedirection = reversedirection_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
-
-	/* Prepare SortSupport data for each column */
-	state->sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
-
-	for (i = 0; i < nkeys; i++)
-	{
-		SortSupport sortKey = state->sortKeys + i;
-
-		AssertArg(attNums[i] != 0);
-		AssertArg(sortOperators[i] != 0);
-
-		sortKey->ssup_cxt = CurrentMemoryContext;
-		sortKey->ssup_collation = sortCollations[i];
-		sortKey->ssup_nulls_first = nullsFirstFlags[i];
-		sortKey->ssup_attno = attNums[i];
-
-		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
-	}
+	state->sortKeys = MakeSortSupportKeys(nkeys,
+										  attNums,
+										  sortOperators,
+										  sortCollations,
+										  nullsFirstFlags);
 
 	if (nkeys == 1)
 		state->onlyKey = state->sortKeys;
@@ -960,6 +946,26 @@ tuplesort_end(Tuplesortstate *state)
 	MemoryContextDelete(state->sortcontext);
 }
 
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+	int i;
+
+	if (state->tapeset)
+		LogicalTapeSetClose(state->tapeset);
+
+	for (i = 0; i < state->memtupcount; i++)
+		free_sort_tuple(state, state->memtuples + i);
+
+	state->status = TSS_INITIAL;
+	state->memtupcount = 0;
+	state->boundUsed = false;
+	state->tapeset = NULL;
+	state->currentRun = 0;
+	state->result_tape = -1;
+	state->bounded = false;
+}
+
 /*
  * Grow the memtuples[] array, if possible within our memory constraint.  We
  * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 239aff3..f0ce4b2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -102,9 +102,9 @@ extern PGDLLIMPORT ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook;
 extern void ExecReScan(PlanState *node);
 extern void ExecMarkPos(PlanState *node);
 extern void ExecRestrPos(PlanState *node);
-extern bool ExecSupportsMarkRestore(NodeTag plantype);
+extern bool ExecSupportsMarkRestore(NodeTag plantype, Plan *node);
 extern bool ExecSupportsBackwardScan(Plan *node);
-extern bool ExecMaterializesOutput(NodeTag plantype);
+extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
 
 /*
  * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b271f21..5d86cd5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1671,8 +1671,11 @@ typedef struct SortState
 	int64		bound;			/* if bounded, how many tuples are needed */
 	bool		sort_Done;		/* sort completed yet? */
 	bool		bounded_Done;	/* value of bounded we did the sort with */
+	bool		finished;
 	int64		bound_Done;		/* value of bound we did the sort with */
 	void	   *tuplesortstate; /* private state of tuplesort.c */
+	SortSupport skipKeys;		/* columns already sorted in input */
+	HeapTuple	prev;
 } SortState;
 
 /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3b9c683..f4f01e2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -582,6 +582,7 @@ typedef struct Sort
 {
 	Plan		plan;
 	int			numCols;		/* number of sort-key columns */
+	int			skipCols;
 	AttrNumber *sortColIdx;		/* their indexes in the target list */
 	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
 	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..bb761f9 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -88,8 +88,9 @@ extern void cost_ctescan(Path *path, PlannerInfo *root,
 			 RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
 extern void cost_sort(Path *path, PlannerInfo *root,
-		  List *pathkeys, Cost input_cost, double tuples, int width,
-		  Cost comparison_cost, int sort_mem,
+		  List *pathkeys, int presorted_keys,
+		  Cost input_startup_cost, Cost input_total_cost,
+		  double tuples, int width, Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9b22fda..9179b4e 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -148,13 +148,16 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
 							   Relids required_outer,
 							   CostSelector cost_criterion);
 extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 										  List *pathkeys,
 										  Relids required_outer,
-										  double fraction);
+										  double fraction,
+										  PlannerInfo *root,
+										  double tuples);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 4504250..7b3aa98 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -50,11 +50,12 @@ extern RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree, Plan *righttree, int wtParam,
 					 List *distinctList, long numGroups);
 extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
-						List *pathkeys, double limit_tuples);
+						List *pathkeys, double limit_tuples, int skipCols);
 extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
 						   Plan *lefttree);
 extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
-						 AttrNumber *grpColIdx, Plan *lefttree);
+						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
+						 int skipCols);
 extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
 		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 8b6b0de..9c4297c 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -150,6 +150,9 @@ ApplySortComparator(Datum datum1, bool isNull1,
 #endif   /*-- PG_USE_INLINE || SORTSUPPORT_INCLUDE_DEFINITIONS */
 
 /* Other functions in utils/sort/sortsupport.c */
+extern SortSupport MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+					Oid *sortOperators, Oid *sortCollations,
+					bool *nullsFirstFlags);
 extern void PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup);
 extern void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup);
 
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 2537883..195e6c1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -24,6 +24,7 @@
 #include "executor/tuptable.h"
 #include "fmgr.h"
 #include "utils/relcache.h"
+#include "utils/sortsupport.h"
 
 
 /* Tuplesortstate is an opaque type whose details are not known outside
@@ -106,6 +107,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 56e2c99..d0de260 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1323,10 +1323,11 @@ ORDER BY thousand, tenthous;
  Merge Append
    Sort Key: tenk1.thousand, tenk1.tenthous
    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
-   ->  Sort
+   ->  Partial sort
          Sort Key: tenk1_1.thousand, tenk1_1.thousand
+         Presorted Key: tenk1_1.thousand
          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
-(6 rows)
+(7 rows)
 
 explain (costs off)
 SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
@@ -1407,10 +1408,11 @@ ORDER BY x, y;
  Merge Append
    Sort Key: a.thousand, a.tenthous
    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
-   ->  Sort
+   ->  Partial sort
          Sort Key: b.unique2, b.unique2
+         Presorted Key: b.unique2
          ->  Index Only Scan using tenk1_unique2 on tenk1 b
-(6 rows)
+(7 rows)
 
 -- exercise rescan code path via a repeatedly-evaluated subquery
 explain (costs off)
#60Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#58)
Re: PoC: Partial sort

On Sun, Jul 13, 2014 at 6:45 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Done. Patch is splitted.

I took a quick look at this.

Have you thought about making your new cmpSortSkipCols() function not
use real comparisons? Since in the circumstances in which this
optimization is expected to be effective (e.g. your original example)
we can also expect a relatively low cardinality for the first n
indexed attributes (again, as in your original example), in general
when cmpSortSkipCols() is called there is a high chance that it will
return true. If any pair of tuples (logically adjacent tuples fed in
to cmpSortSkipCols() by an index scan in logical order) are not fully
equal (i.e. their leading, indexed attributes are not equal) then we
don't care about the details -- we just know that a new sort grouping
is required.

Actually, higher cardinality skip columns is better. Sorting of smaller
groups is faster than sorting larger groups of same size. Also, with
smaller groups you achieve limit more accurate (in average), i.e. sort
smaller amount of total rows.

The idea here is that you can get away with simple binary equality
comparisons, as we do when considering HOT-safety. Of course, you
might find that two bitwise unequal values are equal according to
their ordinary B-Tree support function 1 comparator (e.g. two numerics
that differ only in their display scale). AFAICT this should be okay,
since that just means that you have smaller sort groupings than
strictly necessary. I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

Not correct. Smaller groups are not OK. Imagine that two representations of
same skip column value exists. Index may return them in any order, even
change them one by one. In this case sorting on other column never takes
place, while it should. But some optimizations are still possible:

1. Use bitwise comparison first, then recheck. But, no guarantees that
acceleration will be achieved.
2. Use equality check instead of btree comparison. For "text" datatype
it would be rather faster because of no locale-aware comparison.

------
With best regards,
Alexander Korotkov.

#61Alexander Korotkov
aekorotkov@gmail.com
In reply to: David Rowley (#59)
2 attachment(s)
Re: PoC: Partial sort

On Tue, Aug 19, 2014 at 2:02 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Here's a few notes from reading over the code:

* pathkeys.c

EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

You can use linitial() instead of lfirst(list_head()). The same thing
occurs in costsize.c

Fixed.

* pathkeys.c

The following fragment:

n = pathkeys_common(root->query_pathkeys, pathkeys);

if (n != 0)
{
/* It's useful ... or at least the first N keys are */
return n;
}

return 0; /* path ordering not useful */
}

Could just read:

/* return the number of path keys in common, or 0 if there are none */
return pathkeys_common(root->query_pathkeys, pathkeys);

Fixed.

* execnodes.h

In struct SortState, some new fields don't have a comment.

Fixed.

I've also thrown a few different workloads at the patch and I'm very
impressed with most of the results. Especially when LIMIT is used, however
I've found a regression case which I thought I should highlight, but for
now I can't quite see what could be done to fix it.

create table a (x int not null, y int not null);
insert into a select x.x,y.y from generate_series(1,1000000) x(x) cross
join generate_series(1,10) y(y);

Patched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=92.42..163.21 rows=10 width=8) (actual
time=6239.426..6239.429 rows=10 loops=1)
-> Partial sort (cost=92.42..354064.37 rows=50000 width=8) (actual
time=6239.406..6239.407 rows=10 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
-> Index Scan using a_x_idx on a (cost=0.44..353939.13
rows=50000 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 9999990
Planning time: 0.212 ms
Execution time: 6239.505 ms
(10 rows)

Time: 6241.220 ms

Unpatched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------
Limit (cost=195328.26..195328.28 rows=10 width=8) (actual
time=3077.759..3077.761 rows=10 loops=1)
-> Sort (cost=195328.26..195453.26 rows=50000 width=8) (actual
time=3077.757..3077.757 rows=10 loops=1)
Sort Key: x, y
Sort Method: quicksort Memory: 25kB
-> Seq Scan on a (cost=0.00..194247.77 rows=50000 width=8)
(actual time=0.018..3077.705 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 9999990
Planning time: 0.510 ms
Execution time: 3077.837 ms
(9 rows)

Time: 3080.201 ms

As you can see, the patched version performs an index scan in order to get
the partially sorted results, but it does end up quite a bit slower than
the seqscan/sort that the unpatched master performs. I'm not quite sure how
realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if
something like x+y = 1 was performed too.... After a bit more analysis on
this, I see that if I change the 50k estimate to 10 in the debugger that
the num_groups is properly estimated at 1 and it then performs the seq scan
instead. So it looks like the costings of the patch are not to blame here.
(The 50k row estimate comes from rel tuples / DEFAULT_NUM_DISTINCT)

Yes, the error comes from assumption of 50k row estimate. I've checked
similar example when estimate is fine.

create table b as (select x.x,y.y,x.x z from generate_series(1,1000000)
x(x) cross join generate_series(1,10) y(y));
create index b_x_idx on b(x);
analyze b;

There is column z which is both not in index and not in "order by" clause.
If we replace "x+0=1" with "z=1" optimizer didn't decide to use partial
sort.

explain analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Limit (cost=179056.59..179056.61 rows=10 width=12) (actual
time=1072.498..1072.500 rows=10 loops=1)
-> Sort (cost=179056.59..179056.63 rows=18 width=12) (actual
time=1072.495..1072.495 rows=10 loops=1)
Sort Key: x, y
Sort Method: quicksort Memory: 25kB
-> Seq Scan on b (cost=0.00..179056.21 rows=18 width=12) (actual
time=0.020..1072.454 rows=10 loops=1)
Filter: (z = 1)
Rows Removed by Filter: 9999990
Planning time: 0.501 ms
Execution time: 1072.555 ms
(9 rows)

If we event force optimizer to use partial sort then cost estimation will
be fine.

set enable_seqscan = off;
explain analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Limit (cost=169374.43..263471.04 rows=10 width=12) (actual
time=2237.082..2237.083 rows=10 loops=1)
-> Partial sort (cost=169374.43..338748.34 rows=18 width=12) (actual
time=2237.082..2237.083 rows=10 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
-> Index Scan using b_x_idx on b (cost=0.43..338748.13 rows=18
width=12) (actual time=0.047..2237.062 rows=10 loops=1)
Filter: (z = 1)
Rows Removed by Filter: 9999990
Planning time: 0.089 ms
Execution time: 2237.133 ms
(10 rows)

AFAICS wrong selectivity estimations are general problem which cause
optimizer failures. But in your example "x+y=1" if expression index on
"x+y" would exist then statistics over "x+y" will be collected. So, in case
of expression index estimation will be fine.

------
With best regards,
Alexander Korotkov.

Attachments:

partial-sort-basic-2.patchapplication/octet-stream; name=partial-sort-basic-2.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 781a736..6b19e7e
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_agg_keys(AggState *asta
*** 81,87 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
--- 81,87 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
*************** ExplainNode(PlanState *planstate, List *
*** 940,946 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 940,949 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1751,1757 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1754,1760 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_merge_append_keys(MergeAppendState 
*** 1765,1771 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 ancestors, es);
  }
  
--- 1768,1774 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 ancestors, es);
  }
  
*************** show_agg_keys(AggState *astate, List *an
*** 1783,1789 ****
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
--- 1786,1792 ----
  		/* The key columns refer to the tlist of the child plan */
  		ancestors = lcons(astate, ancestors);
  		show_sort_group_keys(outerPlanState(astate), "Group Key",
! 							 plan->numCols, 0, plan->grpColIdx,
  							 ancestors, es);
  		ancestors = list_delete_first(ancestors);
  	}
*************** show_group_keys(GroupState *gstate, List
*** 1801,1807 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
--- 1804,1810 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
  }
*************** show_group_keys(GroupState *gstate, List
*** 1811,1823 ****
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *result = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
--- 1814,1827 ----
   * as arrays of targetlist indexes
   */
  static void
! show_sort_group_keys(PlanState *planstate,  const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
! 	List	   *resultSort = NIL;
! 	List	   *resultPresorted = NIL;
  	bool		useprefix;
  	int			keyno;
  	char	   *exprstr;
*************** show_sort_group_keys(PlanState *planstat
*** 1844,1853 ****
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 		result = lappend(result, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, result, es);
  }
  
  /*
--- 1848,1862 ----
  		/* Deparse the expression, showing any top-level cast */
  		exprstr = deparse_expression((Node *) target->expr, context,
  									 useprefix, true);
! 
! 		if (keyno < nPresortedKeys)
! 			resultPresorted = lappend(resultPresorted, exprstr);
! 		resultSort = lappend(resultSort, exprstr);
  	}
  
! 	ExplainPropertyList(qlabel, resultSort, es);
! 	if (nPresortedKeys > 0)
! 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 640964c..a8e69d2
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecRestrPos(PlanState *node)
*** 379,385 ****
   * and valuesscan support is actually useless code at present.)
   */
  bool
! ExecSupportsMarkRestore(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 379,385 ----
   * and valuesscan support is actually useless code at present.)
   */
  bool
! ExecSupportsMarkRestore(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecSupportsMarkRestore(NodeTag plantype
*** 389,397 ****
  		case T_TidScan:
  		case T_ValuesScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_Result:
  
  			/*
--- 389,403 ----
  		case T_TidScan:
  		case T_ValuesScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_Result:
  
  			/*
*************** ExecSupportsBackwardScan(Plan *node)
*** 466,475 ****
  				TargetListSupportsBackwardScan(node->targetlist);
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 472,487 ----
  				TargetListSupportsBackwardScan(node->targetlist);
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 535,541 ****
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 547,553 ----
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecMaterializesOutput(NodeTag plantype)
*** 543,551 ****
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
- 		case T_Sort:
  			return true;
  
  		default:
  			break;
  	}
--- 555,569 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		default:
  			break;
  	}
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
new file mode 100644
index 47ed068..c51a144
*** a/src/backend/executor/nodeMergeAppend.c
--- b/src/backend/executor/nodeMergeAppend.c
*************** ExecInitMergeAppend(MergeAppend *node, E
*** 126,144 ****
  	 * initialize sort-key information
  	 */
  	mergestate->ms_nkeys = node->numCols;
! 	mergestate->ms_sortkeys = palloc0(sizeof(SortSupportData) * node->numCols);
! 
! 	for (i = 0; i < node->numCols; i++)
! 	{
! 		SortSupport sortKey = mergestate->ms_sortkeys + i;
! 
! 		sortKey->ssup_cxt = CurrentMemoryContext;
! 		sortKey->ssup_collation = node->collations[i];
! 		sortKey->ssup_nulls_first = node->nullsFirst[i];
! 		sortKey->ssup_attno = node->sortColIdx[i];
! 
! 		PrepareSortSupportFromOrderingOp(node->sortOperators[i], sortKey);
! 	}
  
  	/*
  	 * initialize to show we have not run the subplans yet
--- 126,136 ----
  	 * initialize sort-key information
  	 */
  	mergestate->ms_nkeys = node->numCols;
! 	mergestate->ms_sortkeys = MakeSortSupportKeys(mergestate->ms_nkeys,
! 												  node->sortColIdx,
! 												  node->sortOperators,
! 												  node->collations,
! 												  node->nullsFirst);
  
  	/*
  	 * initialize to show we have not run the subplans yet
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index b88571b..f38190d
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,51 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].ssup_attno;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (ApplySortComparator(datumA, isnullA,
+ 								datumB, isnullB,
+ 								&node->skipKeys[i]))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 68,78 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,132 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
  											  work_mem,
  											  node->randomAccess);
- 		if (node->bounded)
- 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 85,232 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 	else
! 	{
! 		/* Support structures for cmpSortSkipCols - already sorted columns */
! 		if (skipCols)
! 			node->skipKeys = MakeSortSupportKeys(skipCols,
! 												 plannode->sortColIdx,
! 												 plannode->sortOperators,
! 												 plannode->collations,
! 												 plannode->nullsFirst);
  
+ 		/* Only pass on remaining columns that are unsorted */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols - skipCols,
! 											  &(plannode->sortColIdx[skipCols]),
! 											  &(plannode->sortOperators[skipCols]),
! 											  &(plannode->collations[skipCols]),
! 											  &(plannode->nullsFirst[skipCols]),
  											  work_mem,
  											  node->randomAccess);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
  
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 			nTuples++;
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 257,271 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 283,292 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 316,321 ****
--- 428,434 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index aa053a0..e6ee6c9
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 735,740 ****
--- 735,741 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 0cdb790..314d3ab
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1235,1249 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1235,1256 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1273,1285 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1280,1326 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of one group sorting
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1289,1295 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1330,1336 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1300,1309 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1341,1350 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1311,1325 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
--- 1352,1377 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
+ 	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2029,2034 ****
--- 2081,2088 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2055,2060 ****
--- 2109,2116 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
*************** final_cost_mergejoin(PlannerInfo *root, 
*** 2266,2272 ****
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path->pathtype))
  		path->materialize_inner = true;
  
  	/*
--- 2322,2328 ----
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path->pathtype, NULL))
  		path->materialize_inner = true;
  
  	/*
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 2780,2786 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 2836,2842 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan), plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index be54f3d..7bbad4f
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** match_unsorted_outer(PlannerInfo *root,
*** 820,826 ****
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
--- 820,826 ----
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 5d953df..6ac28c4
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 21,31 ****
--- 21,33 ----
  #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
+ #include "optimizer/cost.h"
  #include "optimizer/clauses.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static PathKey *make_canonical_pathkey(PlannerInfo *root,
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 314,345 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 369,377 ****
  }
  
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
--- 397,432 ----
  }
  
  /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0 ||
+ 			fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys. Compares paths according to different
!  *	  fraction of tuples be extracted to start with partial sort.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
*************** Path *
*** 386,411 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
--- 441,524 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each possible partial sort.
+ 	 */
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 		if (n_common_pathkeys < matched_n_common_pathkeys ||
+ 				n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Estimate fraction of outer tuples be fetched to start returning
! 		 * tuples from partial sort.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		/*
! 		 * Always prefer best number of common pathkeys.
! 		 */
! 		if ((
! 				n_common_pathkeys > matched_n_common_pathkeys
! 				||	(n_common_pathkeys == matched_n_common_pathkeys
! 					 && costs_cmp > 0)) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
  	return matched_path;
  }
*************** right_merge_direction(PlannerInfo *root,
*** 1450,1458 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1563,1570 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1463,1475 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1575,1586 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 4b641a2..129ea40
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 149,154 ****
--- 149,155 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 774,779 ****
--- 775,781 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 807,814 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 809,818 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2181,2189 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2185,2195 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2194,2202 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2200,2210 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 3736,3741 ****
--- 3744,3750 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3745,3751 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 3754,3761 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 3759,3764 ****
--- 3769,3775 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4087,4093 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4098,4104 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4107,4113 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4118,4124 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4150,4156 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4161,4167 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4172,4178 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4183,4190 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4205,4211 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4217,4223 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 94ca92d..7eea24e
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 494,500 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 494,502 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index e1480cd..58abc43
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** grouping_planner(PlannerInfo *root, doub
*** 1394,1400 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1394,1402 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1410,1419 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1412,1425 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1423,1434 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1429,1463 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1509,1521 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1538,1553 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1609,1615 ****
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
--- 1641,1649 ----
  							make_sort_from_groupcols(root,
  													 parse->groupClause,
  													 groupColIdx,
! 													 result_plan,
! 													 root->group_pathkeys,
! 													n_common_pathkeys_grouping);
  						current_pathkeys = root->group_pathkeys;
  					}
  					aggstrategy = AGG_SORTED;
*************** grouping_planner(PlannerInfo *root, doub
*** 1652,1658 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 1686,1694 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1769,1781 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 1805,1821 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 1921,1939 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 1961,1981 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 1949,1960 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 1991,2005 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** choose_hashed_grouping(PlannerInfo *root
*** 2698,2703 ****
--- 2743,2749 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 2779,2785 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2825,2832 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 2795,2803 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 2842,2853 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 2812,2821 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2862,2873 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2868,2873 ****
--- 2920,2926 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 2933,2939 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 2986,2993 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 2950,2972 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3004,3033 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 3756,3763 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 3817,3825 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 3e7dc85..3f7fbd4
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 780,786 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 780,786 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan), plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 0410fdd..0f5fee2
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 860,866 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 860,867 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 319e8b2..48966df
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 970,979 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 970,980 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 987,992 ****
--- 988,995 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1346,1352 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1349,1356 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/sortsupport.c b/src/backend/utils/sort/sortsupport.c
new file mode 100644
index 2240fd0..de26b7c
*** a/src/backend/utils/sort/sortsupport.c
--- b/src/backend/utils/sort/sortsupport.c
*************** PrepareSortSupportComparisonShim(Oid cmp
*** 86,91 ****
--- 86,120 ----
  }
  
  /*
+  * Build an array of SortSupportData structures from separated arrays.
+  */
+ SortSupport
+ MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+ 					Oid *sortOperators, Oid *sortCollations,
+ 					bool *nullsFirstFlags)
+ {
+ 	SortSupport sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
+ 	int			i;
+ 
+ 	for (i = 0; i < nkeys; i++)
+ 	{
+ 		SortSupport sortKey = sortKeys + i;
+ 
+ 		AssertArg(attNums[i] != 0);
+ 		AssertArg(sortOperators[i] != 0);
+ 
+ 		sortKey->ssup_cxt = CurrentMemoryContext;
+ 		sortKey->ssup_collation = sortCollations[i];
+ 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
+ 		sortKey->ssup_attno = attNums[i];
+ 
+ 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
+ 	}
+ 
+ 	return sortKeys;
+ }
+ 
+ /*
   * Fill in SortSupport given an ordering operator (btree "<" or ">" operator).
   *
   * Caller must previously have zeroed the SortSupportData structure and then
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 8e57505..6e28a40
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 604,610 ****
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
- 	int			i;
  
  	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
--- 604,609 ----
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 632,655 ****
  	state->reversedirection = reversedirection_heap;
  
  	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
! 
! 	/* Prepare SortSupport data for each column */
! 	state->sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
! 
! 	for (i = 0; i < nkeys; i++)
! 	{
! 		SortSupport sortKey = state->sortKeys + i;
! 
! 		AssertArg(attNums[i] != 0);
! 		AssertArg(sortOperators[i] != 0);
! 
! 		sortKey->ssup_cxt = CurrentMemoryContext;
! 		sortKey->ssup_collation = sortCollations[i];
! 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
! 		sortKey->ssup_attno = attNums[i];
! 
! 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
! 	}
  
  	if (nkeys == 1)
  		state->onlyKey = state->sortKeys;
--- 631,641 ----
  	state->reversedirection = reversedirection_heap;
  
  	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
! 	state->sortKeys = MakeSortSupportKeys(nkeys,
! 										  attNums,
! 										  sortOperators,
! 										  sortCollations,
! 										  nullsFirstFlags);
  
  	if (nkeys == 1)
  		state->onlyKey = state->sortKeys;
*************** tuplesort_end(Tuplesortstate *state)
*** 960,965 ****
--- 946,971 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ 	state->bounded = false;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 0266135..8e0fe0a
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** extern PGDLLIMPORT ExecutorCheckPerms_ho
*** 102,110 ****
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(NodeTag plantype);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype);
  
  /*
   * prototypes from functions in execCurrent.c
--- 102,110 ----
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(NodeTag plantype, Plan *node);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index b271f21..9d206f9
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1670,1678 ****
--- 1670,1682 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SortSupport skipKeys;		/* columns already sorted in input */
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 3b9c683..f4f01e2
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 582,587 ****
--- 582,588 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 75e2afb..bb761f9
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 88,95 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 88,96 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 9b22fda..9179b4e
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 148,160 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 148,163 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index 4504250..7b3aa98
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 50,60 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 50,61 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
new file mode 100644
index 4417143..7d365f2
*** a/src/include/utils/sortsupport.h
--- b/src/include/utils/sortsupport.h
*************** ApplySortComparator(Datum datum1, bool i
*** 150,155 ****
--- 150,158 ----
  #endif   /*-- PG_USE_INLINE || SORTSUPPORT_INCLUDE_DEFINITIONS */
  
  /* Other functions in utils/sort/sortsupport.c */
+ extern SortSupport MakeSortSupportKeys(int nkeys, AttrNumber *attNums,
+ 					Oid *sortOperators, Oid *sortCollations,
+ 					bool *nullsFirstFlags);
  extern void PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup);
  extern void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 2537883..195e6c1
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 56e2c99..d0de260
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** ORDER BY thousand, tenthous;
*** 1323,1332 ****
   Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1323,1333 ----
   Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Partial sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** ORDER BY x, y;
*** 1407,1416 ****
   Merge Append
     Sort Key: a.thousand, a.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Sort
           Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1408,1418 ----
   Merge Append
     Sort Key: a.thousand, a.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Partial sort
           Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
partial-sort-merge-2.patchapplication/octet-stream; name=partial-sort-merge-2.patchDownload
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index 7bbad4f..c1590aa
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** sort_inner_and_outer(PlannerInfo *root,
*** 662,668 ****
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
--- 662,672 ----
  		cur_mergeclauses = find_mergeclauses_for_pathkeys(root,
  														  outerkeys,
  														  true,
! 														  mergeclause_list,
! 														  NULL,
! 														  NULL,
! 														  NULL,
! 														  outer_path->rows);
  
  		/* Should have used them all... */
  		Assert(list_length(cur_mergeclauses) == list_length(mergeclause_list));
*************** match_unsorted_outer(PlannerInfo *root,
*** 832,837 ****
--- 836,842 ----
  		List	   *mergeclauses;
  		List	   *innersortkeys;
  		List	   *trialsortkeys;
+ 		List	   *outersortkeys;
  		Path	   *cheapest_startup_inner;
  		Path	   *cheapest_total_inner;
  		int			num_sortkeys;
*************** match_unsorted_outer(PlannerInfo *root,
*** 937,943 ****
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
--- 942,952 ----
  		mergeclauses = find_mergeclauses_for_pathkeys(root,
  													  outerpath->pathkeys,
  													  true,
! 													  mergeclause_list,
! 													  joinrel,
! 													  &outersortkeys,
! 													  sjinfo,
! 													  outerpath->rows);
  
  		/*
  		 * Done with this outer path if no chance for a mergejoin.
*************** match_unsorted_outer(PlannerInfo *root,
*** 961,967 ****
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outerpath->pathkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
--- 970,976 ----
  		/* Compute the required ordering of the inner path */
  		innersortkeys = make_inner_pathkeys_for_merge(root,
  													  mergeclauses,
! 													  outersortkeys);
  
  		/*
  		 * Generate a mergejoin on the basis of sorting the cheapest inner.
*************** match_unsorted_outer(PlannerInfo *root,
*** 980,986 ****
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   NIL,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
--- 989,995 ----
  						   restrictlist,
  						   merge_pathkeys,
  						   mergeclauses,
! 						   outersortkeys,
  						   innersortkeys);
  
  		/* Can't do anything else if inner path needs to be unique'd */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1038,1044 ****
  		for (sortkeycnt = num_sortkeys; sortkeycnt > 0; sortkeycnt--)
  		{
  			Path	   *innerpath;
- 			List	   *newclauses = NIL;
  
  			/*
  			 * Look for an inner path ordered well enough for the first
--- 1047,1052 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1055,1073 ****
  				 compare_path_costs(innerpath, cheapest_total_inner,
  									TOTAL_COST) < 0))
  			{
- 				/* Found a cheap (or even-cheaper) sorted path */
- 				/* Select the right mergeclauses, if we didn't already */
- 				if (sortkeycnt < num_sortkeys)
- 				{
- 					newclauses =
- 						find_mergeclauses_for_pathkeys(root,
- 													   trialsortkeys,
- 													   false,
- 													   mergeclauses);
- 					Assert(newclauses != NIL);
- 				}
- 				else
- 					newclauses = mergeclauses;
  				try_mergejoin_path(root,
  								   joinrel,
  								   jointype,
--- 1063,1068 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1078,1086 ****
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   newclauses,
! 								   NIL,
! 								   NIL);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
--- 1073,1081 ----
  								   innerpath,
  								   restrictlist,
  								   merge_pathkeys,
! 								   mergeclauses,
! 								   outersortkeys,
! 								   innersortkeys);
  				cheapest_total_inner = innerpath;
  			}
  			/* Same on the basis of cheapest startup cost ... */
*************** match_unsorted_outer(PlannerInfo *root,
*** 1096,1119 ****
  				/* Found a cheap (or even-cheaper) sorted path */
  				if (innerpath != cheapest_total_inner)
  				{
- 					/*
- 					 * Avoid rebuilding clause list if we already made one;
- 					 * saves memory in big join trees...
- 					 */
- 					if (newclauses == NIL)
- 					{
- 						if (sortkeycnt < num_sortkeys)
- 						{
- 							newclauses =
- 								find_mergeclauses_for_pathkeys(root,
- 															   trialsortkeys,
- 															   false,
- 															   mergeclauses);
- 							Assert(newclauses != NIL);
- 						}
- 						else
- 							newclauses = mergeclauses;
- 					}
  					try_mergejoin_path(root,
  									   joinrel,
  									   jointype,
--- 1091,1096 ----
*************** match_unsorted_outer(PlannerInfo *root,
*** 1124,1132 ****
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   newclauses,
! 									   NIL,
! 									   NIL);
  				}
  				cheapest_startup_inner = innerpath;
  			}
--- 1101,1109 ----
  									   innerpath,
  									   restrictlist,
  									   merge_pathkeys,
! 									   mergeclauses,
! 									   outersortkeys,
! 									   innersortkeys);
  				}
  				cheapest_startup_inner = innerpath;
  			}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 6ac28c4..15daba2
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
*************** update_mergeclause_eclasses(PlannerInfo 
*** 1055,1060 ****
--- 1055,1079 ----
  		restrictinfo->right_ec = restrictinfo->right_ec->ec_merged;
  }
  
+ typedef struct
+ {
+ 	Selectivity		selec;
+ 	RestrictInfo   *rinfo;
+ } UnusedRestrictInfo;
+ 
+ static int
+ cmpUnusedRestrictInfo(const void *a1, const void *a2)
+ {
+ 	Selectivity s1 = ((const UnusedRestrictInfo *)a1)->selec;
+ 	Selectivity s2 = ((const UnusedRestrictInfo *)a2)->selec;
+ 
+ 	if (s1 < s2)
+ 		return 1;
+ 	if (s1 > s2)
+ 		return -1;
+ 	return 0;
+ }
+ 
  /*
   * find_mergeclauses_for_pathkeys
   *	  This routine attempts to find a set of mergeclauses that can be
*************** update_mergeclause_eclasses(PlannerInfo 
*** 1066,1071 ****
--- 1085,1091 ----
   *			FALSE if for inner.
   * 'restrictinfos' is a list of mergejoinable restriction clauses for the
   *			join relation being formed.
+  * 'outersortkeys' is additional pathkeys proposed to fit mergeclauses.
   *
   * The restrictinfos must be marked (via outer_is_left) to show which side
   * of each clause is associated with the current outer path.  (See
*************** List *
*** 1078,1087 ****
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
--- 1098,1115 ----
  find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outersortkeys,
! 							   SpecialJoinInfo *sjinfo,
! 							   double rows)
  {
  	List	   *mergeclauses = NIL;
  	ListCell   *i;
+ 	bool	   *used = (bool *)palloc0(sizeof(bool) * list_length(restrictinfos));
+ 	int			k;
+ 	List	   *usedPathkeys = NIL;
+ 	Selectivity selec = 1.0, targetSelec = 2.0 / rows;
  
  	/* make sure we have eclasses cached in the clauses */
  	foreach(i, restrictinfos)
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1134,1139 ****
--- 1162,1168 ----
  		 * deal with the case in create_mergejoin_plan().
  		 *----------
  		 */
+ 		k = 0;
  		foreach(j, restrictinfos)
  		{
  			RestrictInfo *rinfo = (RestrictInfo *) lfirst(j);
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1146,1152 ****
--- 1175,1186 ----
  				clause_ec = rinfo->outer_is_left ?
  					rinfo->right_ec : rinfo->left_ec;
  			if (clause_ec == pathkey_ec)
+ 			{
+ 				selec *= clause_selectivity(root, (Node *)rinfo, 0, JOIN_INNER, sjinfo);
  				matched_restrictinfos = lappend(matched_restrictinfos, rinfo);
+ 				used[k] = true;
+ 			}
+ 			k++;
  		}
  
  		/*
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1157,1162 ****
--- 1191,1198 ----
  		if (matched_restrictinfos == NIL)
  			break;
  
+ 		usedPathkeys = lappend(usedPathkeys, pathkey);
+ 
  		/*
  		 * If we did find usable mergeclause(s) for this sort-key position,
  		 * add them to result list.
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 1164,1169 ****
--- 1200,1272 ----
  		mergeclauses = list_concat(mergeclauses, matched_restrictinfos);
  	}
  
+ 	/*
+ 	 * Try to fill outersortkeys if caller requires it.
+ 	 */
+ 	if (outersortkeys)
+ 	{
+ 		List *addPathkeys, *addMergeclauses, *addRestrictinfos = NIL;
+ 		UnusedRestrictInfo *unusedRestrictinfos;
+ 		int unusedRestrictinfosCount = 0, j;
+ 
+ 		*outersortkeys = pathkeys;
+ 
+ 		if (!mergeclauses || selec <= targetSelec)
+ 			return mergeclauses;
+ 
+ 		/*
+ 		 * Find restrictions unused by given pathkeys.
+ 		 */
+ 		unusedRestrictinfos = (UnusedRestrictInfo *)palloc(
+ 				sizeof(UnusedRestrictInfo) * list_length(restrictinfos));
+ 		k = 0;
+ 		foreach(i, restrictinfos)
+ 		{
+ 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);
+ 			if (!used[k])
+ 			{
+ 				unusedRestrictinfos[unusedRestrictinfosCount].rinfo = rinfo;
+ 				unusedRestrictinfos[unusedRestrictinfosCount++].selec =
+ 					clause_selectivity(root, (Node *)rinfo, 0, JOIN_INNER, sjinfo);
+ 			}
+ 			k++;
+ 		}
+ 		qsort(unusedRestrictinfos, unusedRestrictinfosCount,
+ 				sizeof(UnusedRestrictInfo), cmpUnusedRestrictInfo);
+ 		for (j = 0; j < unusedRestrictinfosCount; j++)
+ 		{
+ 			selec *= unusedRestrictinfos[j].selec;
+ 			addRestrictinfos = lappend(addRestrictinfos,
+ 											unusedRestrictinfos[j].rinfo);
+ 			if (selec <= targetSelec)
+ 				break;
+ 		}
+ 
+ 		if (!addRestrictinfos)
+ 			return mergeclauses;
+ 
+ 		/*
+ 		 * Generate pathkeys based on those restrictions.
+ 		 */
+ 		addPathkeys = select_outer_pathkeys_for_merge(root,
+ 				addRestrictinfos, joinrel);
+ 
+ 		if (!addPathkeys)
+ 			return mergeclauses;
+ 
+ 		/*
+ 		 * Do recursive call to find mergeclauses for additional proposed
+ 		 * pathkeys. We pass NULL to outersortkeys, so there is only one level
+ 		 * of recursion.
+ 		 */
+ 		addMergeclauses = find_mergeclauses_for_pathkeys(root,
+ 				addPathkeys, true, addRestrictinfos, NULL, NULL, NULL, rows);
+ 
+ 		*outersortkeys = list_concat(usedPathkeys, addPathkeys);
+ 		mergeclauses = list_concat(mergeclauses, addMergeclauses);
+ 
+ 	}
+ 
  	return mergeclauses;
  }
  
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 9179b4e..f4ccb3c
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** extern void update_mergeclause_eclasses(
*** 179,185 ****
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
--- 179,189 ----
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
  							   List *pathkeys,
  							   bool outer_keys,
! 							   List *restrictinfos,
! 							   RelOptInfo *joinrel,
! 							   List **outerpathkeys,
! 							   SpecialJoinInfo *sjinfo,
! 							   double rows);
  extern List *select_outer_pathkeys_for_merge(PlannerInfo *root,
  								List *mergeclauses,
  								RelOptInfo *joinrel);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
new file mode 100644
index 2501184..f9aed86
*** a/src/test/regress/expected/join.out
--- b/src/test/regress/expected/join.out
*************** select c.*,a.*,ss1.q1,ss2.q1,ss3.* from
*** 4198,4239 ****
      lateral (select q1, coalesce(ss1.x,q2) as y from int8_tbl d) ss2
    ) on c.q2 = ss2.q1,
    lateral (select * from int4_tbl i where ss2.y > f1) ss3;
!                                                QUERY PLAN                                                
! ---------------------------------------------------------------------------------------------------------
!  Nested Loop
     Output: c.q1, c.q2, a.q1, a.q2, b.q1, d.q1, i.f1
!    Join Filter: ((COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2)) > i.f1)
!    ->  Hash Right Join
!          Output: c.q1, c.q2, a.q1, a.q2, b.q1, d.q1, (COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2))
!          Hash Cond: (d.q1 = c.q2)
!          ->  Nested Loop
!                Output: a.q1, a.q2, b.q1, d.q1, (COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2))
!                ->  Hash Right Join
!                      Output: a.q1, a.q2, b.q1, (COALESCE(b.q2, (b2.f1)::bigint))
!                      Hash Cond: (b.q1 = a.q2)
!                      ->  Nested Loop
!                            Output: b.q1, COALESCE(b.q2, (b2.f1)::bigint)
!                            Join Filter: (b.q1 < b2.f1)
!                            ->  Seq Scan on public.int8_tbl b
!                                  Output: b.q1, b.q2
!                            ->  Materialize
                                   Output: b2.f1
!                                  ->  Seq Scan on public.int4_tbl b2
!                                        Output: b2.f1
!                      ->  Hash
                             Output: a.q1, a.q2
!                            ->  Seq Scan on public.int8_tbl a
!                                  Output: a.q1, a.q2
!                ->  Seq Scan on public.int8_tbl d
!                      Output: d.q1, COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2)
!          ->  Hash
!                Output: c.q1, c.q2
                 ->  Seq Scan on public.int8_tbl c
                       Output: c.q1, c.q2
!    ->  Materialize
!          Output: i.f1
!          ->  Seq Scan on public.int4_tbl i
!                Output: i.f1
  (34 rows)
  
  -- check processing of postponed quals (bug #9041)
--- 4198,4239 ----
      lateral (select q1, coalesce(ss1.x,q2) as y from int8_tbl d) ss2
    ) on c.q2 = ss2.q1,
    lateral (select * from int4_tbl i where ss2.y > f1) ss3;
!                                          QUERY PLAN                                          
! ---------------------------------------------------------------------------------------------
!  Hash Right Join
     Output: c.q1, c.q2, a.q1, a.q2, b.q1, d.q1, i.f1
!    Hash Cond: (d.q1 = c.q2)
!    Filter: ((COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2)) > i.f1)
!    ->  Nested Loop
!          Output: a.q1, a.q2, b.q1, d.q1, (COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2))
!          ->  Hash Right Join
!                Output: a.q1, a.q2, b.q1, (COALESCE(b.q2, (b2.f1)::bigint))
!                Hash Cond: (b.q1 = a.q2)
!                ->  Nested Loop
!                      Output: b.q1, COALESCE(b.q2, (b2.f1)::bigint)
!                      Join Filter: (b.q1 < b2.f1)
!                      ->  Seq Scan on public.int8_tbl b
!                            Output: b.q1, b.q2
!                      ->  Materialize
!                            Output: b2.f1
!                            ->  Seq Scan on public.int4_tbl b2
                                   Output: b2.f1
!                ->  Hash
!                      Output: a.q1, a.q2
!                      ->  Seq Scan on public.int8_tbl a
                             Output: a.q1, a.q2
!          ->  Seq Scan on public.int8_tbl d
!                Output: d.q1, COALESCE((COALESCE(b.q2, (b2.f1)::bigint)), d.q2)
!    ->  Hash
!          Output: c.q1, c.q2, i.f1
!          ->  Nested Loop
!                Output: c.q1, c.q2, i.f1
                 ->  Seq Scan on public.int8_tbl c
                       Output: c.q1, c.q2
!                ->  Materialize
!                      Output: i.f1
!                      ->  Seq Scan on public.int4_tbl i
!                            Output: i.f1
  (34 rows)
  
  -- check processing of postponed quals (bug #9041)
#62Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#60)
Re: PoC: Partial sort

On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Actually, higher cardinality skip columns is better. Sorting of smaller
groups is faster than sorting larger groups of same size. Also, with smaller
groups you achieve limit more accurate (in average), i.e. sort smaller
amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

Not correct. Smaller groups are not OK. Imagine that two representations of
same skip column value exists. Index may return them in any order, even
change them one by one. In this case sorting on other column never takes
place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#61)
Re: PoC: Partial sort

Some quick comments on partial-sort-basic-2.patch:

*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
#include "executor/tuptable.h"
#include "fmgr.h"
#include "utils/relcache.h"
+ #include "utils/sortsupport.h"

Why include sortsupport.h here?

I would like to see more comments, especially within ExecSort(). The
structure of that routine is quite unclear.

I don't really like this MakeSortSupportKeys() stuff within ExecSort():

! /* Support structures for cmpSortSkipCols - already sorted columns */
! if (skipCols)
! node->skipKeys = MakeSortSupportKeys(skipCols,
! plannode->sortColIdx,
! plannode->sortOperators,
! plannode->collations,
! plannode->nullsFirst);

+ /* Only pass on remaining columns that are unsorted */
tuplesortstate = tuplesort_begin_heap(tupDesc,
! plannode->numCols - skipCols,
! &(plannode->sortColIdx[skipCols]),
! &(plannode->sortOperators[skipCols]),
! &(plannode->collations[skipCols]),
! &(plannode->nullsFirst[skipCols]),
work_mem,
node->randomAccess);

You're calling the new function MakeSortSupportKeys() (which
encapsulates setting up sortsupport state for all attributes) twice;
first, to populate the skip keys (the indexed attribute(s)), and
second, when tuplesort_begin_heap() is called, so that there is state
for unindexed sort groups that must be manually sorted. That doesn't
seem great.

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

In this struct:

*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct SortState
*** 1670,1678 ****
--- 1670,1682 ----
bool        bounded;        /* is the result set bounded? */
int64       bound;          /* if bounded, how many tuples are needed */
bool        sort_Done;      /* sort completed yet? */
+   bool        finished;       /* fetching tuples from outer node
+                                  is finished ? */
bool        bounded_Done;   /* value of bounded we did the sort with */
int64       bound_Done;     /* value of bound we did the sort with */
void       *tuplesortstate; /* private state of tuplesort.c */
+   SortSupport skipKeys;       /* columns already sorted in input */
+   HeapTuple   prev;           /* previous tuple from outer node */
} SortState;

I think you should be clearer about the scope and duration of fields
like "finished", if this really belongs here. In general, there should
be some high-level comments about how the feature added by the patch
fits together, which there isn't right now.

That's all I have for now.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#62)
Re: PoC: Partial sort

On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Actually, higher cardinality skip columns is better. Sorting of smaller
groups is faster than sorting larger groups of same size. Also, with

smaller

groups you achieve limit more accurate (in average), i.e. sort smaller
amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

I just meant higher cardinality is cheaper for do partial sort. We could
have some misunderstood because of notions "high" and "low" are relative.
When cardinality is 1 then partial sort seems to be useless. When
cardinality is few then it still could be cheaper to do sequential scan +
sort rather than index scan + partial sort. When cardinality is close to
number of rows then as you mentioned user probably don't need to sort by
rest of columns. At least one exception is when user needs to force
uniqueness of order. So, we need to target something in the middle of this
two corner cases.

I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

Not correct. Smaller groups are not OK. Imagine that two representations

of

same skip column value exists. Index may return them in any order, even
change them one by one. In this case sorting on other column never takes
place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

When strings are not equal bttextcmp still needs to use strcoll while
texteq doesn't need that. So, it would be still advantage of using equality
operator over comparison function: equality operator don't have to compare
unequal values. However, real cost of this advantage depends on presorted
columns cardinality as well.

------
With best regards,
Alexander Korotkov.

#65Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#63)
Re: PoC: Partial sort

On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan <pg@heroku.com> wrote:

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change of
its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.

------
With best regards,
Alexander Korotkov.

#66Andreas Karlsson
andreas@proxel.se
In reply to: Alexander Korotkov (#65)
Re: PoC: Partial sort

On 09/15/2014 01:58 PM, Alexander Korotkov wrote:

On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan <pg@heroku.com
<mailto:pg@heroku.com>> wrote:

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change
of its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.

Hi,

Are you planning to work on this patch for 9.6?

I generally agree with Peter that the changes to the sorting probably
belong in the tuplesort code rather than in the executor. This way it
should also be theoretically possible to support mark/restore.

Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Peter Geoghegan
pg@heroku.com
In reply to: Andreas Karlsson (#66)
Re: PoC: Partial sort

On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas@proxel.se> wrote:

Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#67)
2 attachment(s)
Re: PoC: Partial sort

On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas@proxel.se>
wrote:

Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

I'm trying to to whisk dust. Rebased version of patch is attached. This
patch isn't passing regression tests because of plan changes. I'm not yet
sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
any help is appreciated.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-3.patchapplication/octet-stream; name=partial-sort-basic-3.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 7fb8a14..05cc125
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 88,94 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 88,94 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** ExplainNode(PlanState *planstate, List *
*** 901,907 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 901,910 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1756,1762 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1759,1765 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1772,1778 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1775,1781 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1796,1802 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1799,1805 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1852,1858 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1855,1861 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1909,1915 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1912,1918 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1922,1934 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1925,1938 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 1968,1976 ****
--- 1972,1984 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 163650c..b44ce69
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecRestrPos(PlanState *node)
*** 382,388 ****
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
--- 382,388 ----
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode, Plan *node)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 394,402 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 394,410 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 498,507 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 506,521 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 567,573 ****
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 581,587 ----
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecMaterializesOutput(NodeTag plantype)
*** 575,583 ****
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
- 		case T_Sort:
  			return true;
  
  		default:
  			break;
  	}
--- 589,605 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		default:
  			break;
  	}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index af1dccf..7345fcb
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,109 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 126,136 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,132 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
  											  work_mem,
  											  node->randomAccess);
- 		if (node->bounded)
- 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 143,286 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 	else
! 	{
! 		/* Support structures for cmpSortSkipCols - already sorted columns */
! 		if (skipCols)
! 			prepareSkipCols(plannode, node);
  
+ 		/* Only pass on remaining columns that are unsorted */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols - skipCols,
! 											  &(plannode->sortColIdx[skipCols]),
! 											  &(plannode->sortOperators[skipCols]),
! 											  &(plannode->collations[skipCols]),
! 											  &(plannode->nullsFirst[skipCols]),
  											  work_mem,
  											  node->randomAccess);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
  
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 			nTuples++;
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 311,325 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 337,346 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 484,490 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index c176ff9..c916072
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 826,831 ****
--- 826,832 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 1b61fd9..a6c1c22
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1381,1395 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1381,1402 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1419,1431 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1426,1472 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of one group sorting
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1435,1441 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1476,1482 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1446,1455 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1487,1496 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1457,1471 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
--- 1498,1523 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
+ 	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2205,2210 ****
--- 2257,2264 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2231,2236 ****
--- 2285,2292 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
*************** final_cost_mergejoin(PlannerInfo *root, 
*** 2442,2448 ****
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path))
  		path->materialize_inner = true;
  
  	/*
--- 2498,2504 ----
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path, NULL))
  		path->materialize_inner = true;
  
  	/*
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 2956,2962 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3012,3018 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan), plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index a35c881..7b762a6
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** match_unsorted_outer(PlannerInfo *root,
*** 870,882 ****
  	}
  	else if (nestjoinOK)
  	{
  		/*
  		 * Consider materializing the cheapest inner path, unless
  		 * enable_material is off or the path in question materializes its
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
--- 870,885 ----
  	}
  	else if (nestjoinOK)
  	{
+ 		if (inner_cheapest_total && inner_cheapest_total->pathtype == T_Sort)
+ 			elog(ERROR, "Sort");
+ 
  		/*
  		 * Consider materializing the cheapest inner path, unless
  		 * enable_material is off or the path in question materializes its
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6b5d78..490f343
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 21,31 ****
--- 21,33 ----
  #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
+ #include "optimizer/cost.h"
  #include "optimizer/clauses.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static PathKey *make_canonical_pathkey(PlannerInfo *root,
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 314,345 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 369,377 ****
  }
  
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
--- 397,432 ----
  }
  
  /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0 ||
+ 			fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys. Compares paths according to different
!  *	  fraction of tuples be extracted to start with partial sort.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
*************** Path *
*** 386,411 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
--- 441,524 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each possible partial sort.
+ 	 */
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 		if (n_common_pathkeys < matched_n_common_pathkeys ||
+ 				n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Estimate fraction of outer tuples be fetched to start returning
! 		 * tuples from partial sort.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		/*
! 		 * Always prefer best number of common pathkeys.
! 		 */
! 		if ((
! 				n_common_pathkeys > matched_n_common_pathkeys
! 				||	(n_common_pathkeys == matched_n_common_pathkeys
! 					 && costs_cmp > 0)) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
  	return matched_path;
  }
*************** right_merge_direction(PlannerInfo *root,
*** 1450,1458 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1563,1570 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1463,1475 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1575,1586 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 791b64e..4d610cf
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 163,168 ****
--- 163,169 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 810,815 ****
--- 811,817 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 843,850 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 845,854 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2436,2444 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2440,2450 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2449,2457 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2455,2465 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 4021,4026 ****
--- 4029,4035 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4030,4036 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4039,4046 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4044,4049 ****
--- 4054,4060 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4372,4378 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4383,4389 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4392,4398 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4403,4409 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4435,4441 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4446,4452 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4457,4463 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4468,4475 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4490,4496 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4502,4508 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index a761cfd..87efec2
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 505,511 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 505,513 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index e1ee67c..906e8e7
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** static Plan *build_grouping_chain(Planne
*** 133,139 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan);
  
  /*****************************************************************************
   *
--- 133,141 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys);
  
  /*****************************************************************************
   *
*************** grouping_planner(PlannerInfo *root, doub
*** 1752,1758 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1754,1762 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1768,1777 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1772,1785 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1781,1792 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1789,1823 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1878,1890 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1909,1924 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 2010,2016 ****
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan);
  
  				/*
  				 * these are destroyed by build_grouping_chain, so make sure
--- 2044,2052 ----
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan,
! 												   root->group_pathkeys,
! 												   n_common_pathkeys_grouping);
  
  				/*
  				 * these are destroyed by build_grouping_chain, so make sure
*************** grouping_planner(PlannerInfo *root, doub
*** 2034,2040 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 2070,2078 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 2172,2184 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 2210,2226 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 2325,2343 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 2367,2387 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 2353,2364 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 2397,2411 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** build_grouping_chain(PlannerInfo *root,
*** 2462,2468 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
--- 2509,2517 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
*************** build_grouping_chain(PlannerInfo *root,
*** 2483,2489 ****
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan);
  	}
  
  	/*
--- 2532,2540 ----
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan,
! 									 path_keys,
! 									 n_common_pathkeys);
  	}
  
  	/*
*************** build_grouping_chain(PlannerInfo *root,
*** 2507,2513 ****
  			make_sort_from_groupcols(root,
  									 groupClause,
  									 new_grpColIdx,
! 									 result_plan);
  
  		/*
  		 * sort_plan includes the cost of result_plan over again, which is not
--- 2558,2566 ----
  			make_sort_from_groupcols(root,
  									 groupClause,
  									 new_grpColIdx,
! 									 result_plan,
! 									 NIL,
! 									 0);
  
  		/*
  		 * sort_plan includes the cost of result_plan over again, which is not
*************** choose_hashed_grouping(PlannerInfo *root
*** 3623,3628 ****
--- 3676,3682 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 3704,3710 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3758,3765 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 3720,3728 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 3775,3786 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 3737,3746 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3795,3806 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 3793,3798 ****
--- 3853,3859 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 3858,3864 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3919,3926 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 3875,3897 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3937,3966 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 4681,4688 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 4750,4758 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 82414d4..474f20e
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 823,829 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 823,829 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan), plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 8884fb1..4e84de5
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 863,869 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 863,870 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 1895a68..d895907
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 996,1005 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 996,1006 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1013,1018 ****
--- 1014,1021 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1229,1235 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1232,1239 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index d532e87..e12624e
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_end(Tuplesortstate *state)
*** 1068,1073 ****
--- 1068,1093 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ 	state->bounded = false;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 4f77692..95ff06f
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** struct Path;					/* avoid including rela
*** 104,112 ****
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype);
  
  /*
   * prototypes from functions in execCurrent.c
--- 104,112 ----
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode, Plan *node);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 23670e1..bb10360
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1784,1789 ****
--- 1784,1796 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ typedef struct SkipKeyData
+ {
+ 	FunctionCallInfoData	fcinfo;
+ 	FmgrInfo				flinfo;
+ 	OffsetNumber			attno;
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1795,1803 ****
--- 1802,1814 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 92fd8e4..fc4ac40
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 675,680 ****
--- 675,681 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 25a7303..2740302
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 87123a5..b3d0dbf
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 165,177 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 165,180 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index 1fb8504..791563c
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 51,61 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 51,62 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index de6fc56..7b47f2f
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 89b6c1c..450c251
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** ORDER BY thousand, tenthous;
*** 1357,1366 ****
   Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1357,1367 ----
   Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Partial sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** ORDER BY x, y;
*** 1441,1450 ****
   Merge Append
     Sort Key: a.thousand, a.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Sort
           Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1442,1452 ----
   Merge Append
     Sort Key: a.thousand, a.tenthous
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Partial sort
           Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /Users/smagen/projects/postgresql/postgresql/src/test/regress/expected/insert_conflict.out	2015-10-16 17:47:37.000000000 +0300
--- /Users/smagen/projects/postgresql/postgresql/src/test/regress/results/insert_conflict.out	2015-10-16 18:29:12.000000000 +0300
***************
*** 53,63 ****
     Conflict Filter: (alternatives: SubPlan 1 or hashed SubPlan 2)
     ->  Result
     SubPlan 1
!      ->  Index Only Scan using both_index_expr_key on insertconflicttest ii
!            Index Cond: (key = excluded.key)
     SubPlan 2
       ->  Seq Scan on insertconflicttest ii_1
! (10 rows)
  
  -- Neither collation nor operator class specifications are required --
  -- supplying them merely *limits* matches to indexes with matching opclasses
--- 53,65 ----
     Conflict Filter: (alternatives: SubPlan 1 or hashed SubPlan 2)
     ->  Result
     SubPlan 1
!      ->  Bitmap Heap Scan on insertconflicttest ii
!            Recheck Cond: (key = excluded.key)
!            ->  Bitmap Index Scan on both_index_expr_key
!                  Index Cond: (key = excluded.key)
     SubPlan 2
       ->  Seq Scan on insertconflicttest ii_1
! (12 rows)
  
  -- Neither collation nor operator class specifications are required --
  -- supplying them merely *limits* matches to indexes with matching opclasses

======================================================================

*** /Users/smagen/projects/postgresql/postgresql/src/test/regress/expected/rowsecurity.out	2015-10-16 17:47:37.000000000 +0300
--- /Users/smagen/projects/postgresql/postgresql/src/test/regress/results/rowsecurity.out	2015-10-16 18:29:21.000000000 +0300
***************
*** 902,909 ****
  (3 rows)
  
  EXPLAIN (COSTS OFF) SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
!                              QUERY PLAN                             
! --------------------------------------------------------------------
   Subquery Scan on s2
     Filter: (s2.y ~~ '%28%'::text)
     ->  Seq Scan on s2 s2_1
--- 902,909 ----
  (3 rows)
  
  EXPLAIN (COSTS OFF) SELECT (SELECT x FROM s1 LIMIT 1) xx, * FROM s2 WHERE y like '%28%';
!                                 QUERY PLAN                                
! --------------------------------------------------------------------------
   Subquery Scan on s2
     Filter: (s2.y ~~ '%28%'::text)
     ->  Seq Scan on s2 s2_1
***************
*** 911,925 ****
     SubPlan 1
       ->  Limit
             ->  Subquery Scan on s1
!                  ->  Nested Loop Semi Join
!                        Join Filter: (s1_1.a = s2_2.x)
                         ->  Seq Scan on s1 s1_1
!                        ->  Materialize
!                              ->  Subquery Scan on s2_2
!                                    Filter: (s2_2.y ~~ '%af%'::text)
!                                    ->  Seq Scan on s2 s2_3
!                                          Filter: ((x % 2) = 0)
! (15 rows)
  
  SET SESSION AUTHORIZATION rls_regress_user0;
  ALTER POLICY p2 ON s2 USING (x in (select a from s1 where b like '%d2%'));
--- 911,927 ----
     SubPlan 1
       ->  Limit
             ->  Subquery Scan on s1
!                  ->  Hash Join
!                        Hash Cond: (s1_1.a = s2_2.x)
                         ->  Seq Scan on s1 s1_1
!                        ->  Hash
!                              ->  HashAggregate
!                                    Group Key: s2_2.x
!                                    ->  Subquery Scan on s2_2
!                                          Filter: (s2_2.y ~~ '%af%'::text)
!                                          ->  Seq Scan on s2 s2_3
!                                                Filter: ((x % 2) = 0)
! (17 rows)
  
  SET SESSION AUTHORIZATION rls_regress_user0;
  ALTER POLICY p2 ON s2 USING (x in (select a from s1 where b like '%d2%'));

======================================================================

#69Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#68)
1 attachment(s)
Re: PoC: Partial sort

On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas@proxel.se>
wrote:

Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

I'm trying to to whisk dust. Rebased version of patch is attached. This
patch isn't passing regression tests because of plan changes. I'm not yet
sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
any help is appreciated.

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-4.patchapplication/octet-stream; name=partial-sort-basic-4.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 7fb8a14..05cc125
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 88,94 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 88,94 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** ExplainNode(PlanState *planstate, List *
*** 901,907 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 901,910 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1756,1762 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1759,1765 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1772,1778 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1775,1781 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1796,1802 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1799,1805 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1852,1858 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1855,1861 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1909,1915 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1912,1918 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1922,1934 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1925,1938 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 1968,1976 ****
--- 1972,1984 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 163650c..b44ce69
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecRestrPos(PlanState *node)
*** 382,388 ****
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
--- 382,388 ----
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode, Plan *node)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 394,402 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 394,410 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 498,507 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 506,521 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 567,573 ****
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 581,587 ----
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecMaterializesOutput(NodeTag plantype)
*** 575,583 ****
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
- 		case T_Sort:
  			return true;
  
  		default:
  			break;
  	}
--- 589,605 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		default:
  			break;
  	}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index af1dccf..7345fcb
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,109 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 126,136 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,132 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
  											  work_mem,
  											  node->randomAccess);
- 		if (node->bounded)
- 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 143,286 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 	else
! 	{
! 		/* Support structures for cmpSortSkipCols - already sorted columns */
! 		if (skipCols)
! 			prepareSkipCols(plannode, node);
  
+ 		/* Only pass on remaining columns that are unsorted */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols - skipCols,
! 											  &(plannode->sortColIdx[skipCols]),
! 											  &(plannode->sortOperators[skipCols]),
! 											  &(plannode->collations[skipCols]),
! 											  &(plannode->nullsFirst[skipCols]),
  											  work_mem,
  											  node->randomAccess);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
  
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 			nTuples++;
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 311,325 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 337,346 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 484,490 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index c176ff9..c916072
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 826,831 ****
--- 826,832 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 1b61fd9..a6c1c22
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1381,1395 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1381,1402 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1419,1431 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1426,1472 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of one group sorting
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1435,1441 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1476,1482 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1446,1455 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1487,1496 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1457,1471 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
--- 1498,1523 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
+ 	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2205,2210 ****
--- 2257,2264 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2231,2236 ****
--- 2285,2292 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
*************** final_cost_mergejoin(PlannerInfo *root, 
*** 2442,2448 ****
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path))
  		path->materialize_inner = true;
  
  	/*
--- 2498,2504 ----
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path, NULL))
  		path->materialize_inner = true;
  
  	/*
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 2956,2962 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3012,3018 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan), plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index a35c881..7b762a6
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** match_unsorted_outer(PlannerInfo *root,
*** 870,882 ****
  	}
  	else if (nestjoinOK)
  	{
  		/*
  		 * Consider materializing the cheapest inner path, unless
  		 * enable_material is off or the path in question materializes its
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
--- 870,885 ----
  	}
  	else if (nestjoinOK)
  	{
+ 		if (inner_cheapest_total && inner_cheapest_total->pathtype == T_Sort)
+ 			elog(ERROR, "Sort");
+ 
  		/*
  		 * Consider materializing the cheapest inner path, unless
  		 * enable_material is off or the path in question materializes its
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index c6b5d78..1bc4619
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 21,31 ****
--- 21,33 ----
  #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
+ #include "optimizer/cost.h"
  #include "optimizer/clauses.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static PathKey *make_canonical_pathkey(PlannerInfo *root,
*************** compare_pathkeys(List *keys1, List *keys
*** 312,317 ****
--- 314,345 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 369,377 ****
  }
  
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
--- 397,433 ----
  }
  
  /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0)
+ 		fraction1 = 1.0;
+ 	if (fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		fraction2 = 1.0;
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys. Compares paths according to different
!  *	  fraction of tuples be extracted to start with partial sort.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
*************** Path *
*** 386,412 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
  
--- 442,521 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each possible partial sort.
+ 	 */
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 
+ 		if (n_pathkeys != 0 && n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Estimate fraction of outer tuples be fetched to start returning
! 		 * tuples from partial sort.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		if (costs_cmp > 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
+ 
  	return matched_path;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1450,1458 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1559,1566 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1463,1475 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1571,1582 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 791b64e..4d610cf
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 163,168 ****
--- 163,169 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 810,815 ****
--- 811,817 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 843,850 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 845,854 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2436,2444 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2440,2450 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2449,2457 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2455,2465 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 4021,4026 ****
--- 4029,4035 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4030,4036 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4039,4046 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4044,4049 ****
--- 4054,4060 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4372,4378 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4383,4389 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4392,4398 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4403,4409 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4435,4441 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4446,4452 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4457,4463 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4468,4475 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4490,4496 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4502,4508 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index a761cfd..87efec2
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 505,511 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 505,513 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 536b55e..2dbb27b
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** static Plan *build_grouping_chain(Planne
*** 134,140 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan);
  
  /*****************************************************************************
   *
--- 134,142 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys);
  
  /*****************************************************************************
   *
*************** grouping_planner(PlannerInfo *root, doub
*** 1762,1768 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1764,1772 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1778,1787 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1782,1795 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1791,1802 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1799,1833 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1888,1900 ****
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
--- 1919,1934 ----
  			 * results.
  			 */
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  			{
  				need_sort_for_grouping = true;
  
*************** grouping_planner(PlannerInfo *root, doub
*** 2020,2026 ****
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan);
  
  				/*
  				 * these are destroyed by build_grouping_chain, so make sure
--- 2054,2062 ----
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan,
! 												   root->group_pathkeys,
! 												   n_common_pathkeys_grouping);
  
  				/*
  				 * these are destroyed by build_grouping_chain, so make sure
*************** grouping_planner(PlannerInfo *root, doub
*** 2044,2050 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 2080,2088 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 2182,2194 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 2220,2236 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 2335,2353 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 2377,2397 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 2363,2374 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 2407,2421 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** build_grouping_chain(PlannerInfo *root,
*** 2472,2478 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
--- 2519,2527 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
*************** build_grouping_chain(PlannerInfo *root,
*** 2493,2499 ****
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan);
  	}
  
  	/*
--- 2542,2550 ----
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan,
! 									 path_keys,
! 									 n_common_pathkeys);
  	}
  
  	/*
*************** build_grouping_chain(PlannerInfo *root,
*** 2517,2523 ****
  			make_sort_from_groupcols(root,
  									 groupClause,
  									 new_grpColIdx,
! 									 result_plan);
  
  		/*
  		 * sort_plan includes the cost of result_plan over again, which is not
--- 2568,2576 ----
  			make_sort_from_groupcols(root,
  									 groupClause,
  									 new_grpColIdx,
! 									 result_plan,
! 									 NIL,
! 									 0);
  
  		/*
  		 * sort_plan includes the cost of result_plan over again, which is not
*************** choose_hashed_grouping(PlannerInfo *root
*** 3633,3638 ****
--- 3686,3692 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 3714,3720 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3768,3775 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 3730,3738 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 3785,3796 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 3747,3756 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3805,3816 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 3803,3808 ****
--- 3863,3869 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 3868,3874 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3929,3936 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 3885,3907 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3947,3976 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 4691,4698 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 4760,4768 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 82414d4..474f20e
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 823,829 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 823,829 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan), plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 8884fb1..4e84de5
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 863,869 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 863,870 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 1895a68..d895907
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 996,1005 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 996,1006 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1013,1018 ****
--- 1014,1021 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1229,1235 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1232,1239 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index d532e87..e12624e
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_end(Tuplesortstate *state)
*** 1068,1073 ****
--- 1068,1093 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ 	state->bounded = false;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 4f77692..95ff06f
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** struct Path;					/* avoid including rela
*** 104,112 ****
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype);
  
  /*
   * prototypes from functions in execCurrent.c
--- 104,112 ----
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode, Plan *node);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 4fcdcc4..05bada1
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1784,1789 ****
--- 1784,1796 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ typedef struct SkipKeyData
+ {
+ 	FunctionCallInfoData	fcinfo;
+ 	FmgrInfo				flinfo;
+ 	OffsetNumber			attno;
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1795,1803 ****
--- 1802,1814 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 92fd8e4..fc4ac40
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 675,680 ****
--- 675,681 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 25a7303..2740302
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 87123a5..b3d0dbf
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 165,177 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 165,180 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index 1fb8504..791563c
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 51,61 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 51,62 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index de6fc56..7b47f2f
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 89b6c1c..bc49011
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** SELECT thousand, thousand FROM tenk1
*** 1354,1366 ****
  ORDER BY thousand, tenthous;
                                 QUERY PLAN                                
  -------------------------------------------------------------------------
!  Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Sort
!          Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1354,1367 ----
  ORDER BY thousand, tenthous;
                                 QUERY PLAN                                
  -------------------------------------------------------------------------
!  Partial sort
     Sort Key: tenk1.thousand, tenk1.tenthous
!    Presorted Key: tenk1.thousand
!    ->  Merge Append
!          Sort Key: tenk1.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on tenk1
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** SELECT x, y FROM
*** 1436,1450 ****
     UNION ALL
     SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
  ORDER BY x, y;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Merge Append
     Sort Key: a.thousand, a.tenthous
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Sort
!          Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1437,1452 ----
     UNION ALL
     SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
  ORDER BY x, y;
!                             QUERY PLAN                             
! -------------------------------------------------------------------
!  Partial sort
     Sort Key: a.thousand, a.tenthous
!    Presorted Key: a.thousand
!    ->  Merge Append
!          Sort Key: a.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
#70Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#69)
Re: PoC: Partial sort

On Tue, Oct 20, 2015 at 4:17 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

I don't see an entry in the CF app for this. This seems like something
I should review, though.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#69)
Re: PoC: Partial sort

On Tue, Oct 20, 2015 at 4:17 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

I took a look at this. My remarks are not comprehensive, but are just
what I noticed having looked at this for the first time in well over a
year.

Before anything else, I would like to emphasize that I think that this
is pretty important work; it's not just a "nice to have". I'm very
glad you picked it up, because we need it. In the real world, there
will be *lots* of cases that this helps.

Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=429.54..434.53 rows=100 width=16) (actual
time=15125.819..22414.038 rows=100 loops=1)
-> Partial sort (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)
Sort Key: v1, v2
Presorted Key: v1
Sort Method: top-N heapsort Memory: 27kB
-> Index Scan using test_v1_idx on test
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)
Planning time: 0.948 ms
Execution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+               case T_Sort:
+                       /* We shouldn't reach here without having plan node */
+                       Assert(node);
+                       /* With skipCols sort node holds only last bucket */
+                       if (node && ((Sort *)node)->skipCols == 0)
+                               return true;
+                       else
+                               return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:

        /*
+        * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+        * or EXEC_FLAG_MARK, because we hold only current bucket in
+        * tuplesortstate.
+        */
+       Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+
                  EXEC_FLAG_BACKWARD |
+
                  EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

Misc. issues
----------------

_readSort() needs READ_INT_FIELD(). _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.

tuplesort.h should not include sortsupport.h. It's a modularity
violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

What is this?

+               if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+                       elog(ERROR, "Sort");

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments. Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common(). I think that
comments should make this quite clear.

* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch. ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

That's all I have for now...

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexander Korotkov (#69)
Re: PoC: Partial sort

Hi,

On 10/20/2015 01:17 PM, Alexander Korotkov wrote:

On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov
<aekorotkov@gmail.com <mailto:aekorotkov@gmail.com>> wrote:

On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com
<mailto:pg@heroku.com>> wrote:

On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson
<andreas@proxel.se <mailto:andreas@proxel.se>> wrote:

Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

I'm trying to to whisk dust. Rebased version of patch is attached.
This patch isn't passing regression tests because of plan changes.
I'm not yet sure about those changes: why they happens and are they
really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and
RLS, any help is appreciated.

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

Alexander, are you working on this patch? I'd like to look at the patch,
but the last available version (v4) no longer applies - there's plenty
of bitrot. Do you plan to send an updated / rebased version?

The main thing I'm particularly interested in is how much is this
coupled with the Sort node, and whether it's possible to feed partially
sorted tuples into other nodes.

I'm particularly thinking about Hash Aggregate, because the partial sort
allows to keep only the "current group" in a hash table, making it much
more memory efficient / faster. What do you think?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Peter Geoghegan
pg@heroku.com
In reply to: Tomas Vondra (#72)
Re: PoC: Partial sort

On Sat, Jan 23, 2016 at 4:07 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The main thing I'm particularly interested in is how much is this coupled
with the Sort node, and whether it's possible to feed partially sorted
tuples into other nodes.

That's cool, but I'm particularly interested in seeing Alexander get
back to this because it's an important project on its own. We should
really have this.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Tomas Vondra (#72)
1 attachment(s)
Re: PoC: Partial sort

Hi, Tomas!

On Sat, Jan 23, 2016 at 3:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 10/20/2015 01:17 PM, Alexander Korotkov wrote:

On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov
<aekorotkov@gmail.com <mailto:aekorotkov@gmail.com>> wrote:

On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg@heroku.com
<mailto:pg@heroku.com>> wrote:

On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson
<andreas@proxel.se <mailto:andreas@proxel.se>> wrote:

Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

I'm trying to to whisk dust. Rebased version of patch is attached.
This patch isn't passing regression tests because of plan changes.
I'm not yet sure about those changes: why they happens and are they
really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and
RLS, any help is appreciated.

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

Alexander, are you working on this patch? I'd like to look at the patch,
but the last available version (v4) no longer applies - there's plenty of
bitrot. Do you plan to send an updated / rebased version?

I'm sorry that I didn't found time for this yet. I'm certainly planning to
get back to this in near future. The attached version is just rebased
without any optimization.

The main thing I'm particularly interested in is how much is this coupled

with the Sort node, and whether it's possible to feed partially sorted
tuples into other nodes.

I'm particularly thinking about Hash Aggregate, because the partial sort
allows to keep only the "current group" in a hash table, making it much
more memory efficient / faster. What do you think?

This seems to me very reasonable optimization. And it would be nice to
implement some generalized way of presorted group processing. For instance,
we could have some special node, say "Group Scan" which have 2 children:
source and node which process every group. For "partial sort" the second
node would be Sort node. But it could be Hash Aggregate node as well.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-5.patchapplication/octet-stream; name=partial-sort-basic-5.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 25d8ca0..60081cb
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 88,94 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 88,94 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** ExplainNode(PlanState *planstate, List *
*** 902,908 ****
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
--- 902,911 ----
  			pname = sname = "Materialize";
  			break;
  		case T_Sort:
! 			if (((Sort *) plan)->skipCols > 0)
! 				pname = sname = "Partial sort";
! 			else
! 				pname = sname = "Sort";
  			break;
  		case T_Group:
  			pname = sname = "Group";
*************** show_sort_keys(SortState *sortstate, Lis
*** 1738,1744 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1741,1747 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1754,1760 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1757,1763 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1778,1784 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1781,1787 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1834,1840 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1837,1843 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1891,1897 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1894,1900 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1904,1916 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1907,1920 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 1950,1958 ****
--- 1954,1966 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 35864c1..951ea69
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecRestrPos(PlanState *node)
*** 383,389 ****
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
--- 383,389 ----
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode, Plan *node)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 395,403 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 395,411 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 508,517 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 516,531 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 572,578 ****
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 586,592 ----
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecMaterializesOutput(NodeTag plantype)
*** 580,588 ****
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
- 		case T_Sort:
  			return true;
  
  		default:
  			break;
  	}
--- 594,610 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		default:
  			break;
  	}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index 102dbdf..e92ddfa
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,109 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 126,136 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,132 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
  											  work_mem,
  											  node->randomAccess);
- 		if (node->bounded)
- 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 143,286 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 	else
! 	{
! 		/* Support structures for cmpSortSkipCols - already sorted columns */
! 		if (skipCols)
! 			prepareSkipCols(plannode, node);
  
+ 		/* Only pass on remaining columns that are unsorted */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols - skipCols,
! 											  &(plannode->sortColIdx[skipCols]),
! 											  &(plannode->sortOperators[skipCols]),
! 											  &(plannode->collations[skipCols]),
! 											  &(plannode->nullsFirst[skipCols]),
  											  work_mem,
  											  node->randomAccess);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 	}
  
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
  
! 	/*
! 	 * Put next group of tuples where skipCols" sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else if (node->prev)
+ 		{
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 			nTuples++;
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 311,325 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 337,346 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 484,490 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 5877037..ef9ce04
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 827,832 ****
--- 827,833 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 5fc80e7..b6bdc9a
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1424,1438 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1424,1445 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1462,1474 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1469,1515 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List *groupExprs = NIL;
! 		ListCell *l;
! 		int i = 0;
! 
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			groupExprs = lappend(groupExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		num_groups = estimate_num_groups(root, groupExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of one group sorting
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1478,1484 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1519,1525 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1489,1498 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1530,1539 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1500,1514 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
--- 1541,1566 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
  	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 	startup_cost += input_run_cost / num_groups;
+ 	run_cost += input_run_cost * ((num_groups - 1.0) / num_groups);
+ 
+ 	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
  	 * doesn't do qual-checking or projection, so it has less overhead than
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2248,2253 ****
--- 2300,2307 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->parent->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2274,2279 ****
--- 2328,2335 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->parent->width,
*************** final_cost_mergejoin(PlannerInfo *root, 
*** 2485,2491 ****
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path))
  		path->materialize_inner = true;
  
  	/*
--- 2541,2547 ----
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path, NULL))
  		path->materialize_inner = true;
  
  	/*
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 2999,3005 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3055,3061 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan), plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index e61fa58..a9187c6
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** match_unsorted_outer(PlannerInfo *root,
*** 883,895 ****
  	}
  	else if (nestjoinOK)
  	{
  		/*
  		 * Consider materializing the cheapest inner path, unless
  		 * enable_material is off or the path in question materializes its
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
--- 883,898 ----
  	}
  	else if (nestjoinOK)
  	{
+ 		if (inner_cheapest_total && inner_cheapest_total->pathtype == T_Sort)
+ 			elog(ERROR, "Sort");
+ 
  		/*
  		 * Consider materializing the cheapest inner path, unless
  		 * enable_material is off or the path in question materializes its
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index eed39b9..cce13b9
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 21,31 ****
--- 21,33 ----
  #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
  #include "nodes/plannodes.h"
+ #include "optimizer/cost.h"
  #include "optimizer/clauses.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 309,314 ****
--- 311,342 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 366,374 ****
  }
  
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
--- 394,430 ----
  }
  
  /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ static int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0)
+ 		fraction1 = 1.0;
+ 	if (fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		fraction2 = 1.0;
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys. Compares paths according to different
!  *	  fraction of tuples be extracted to start with partial sort.
   *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
*************** Path *
*** 383,409 ****
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
  
--- 439,518 ----
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *num_groups, matched_fraction;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each possible partial sort.
+ 	 */
+ 	i = 0;
+ 	num_groups = (double *)palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		num_groups[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 
+ 		if (n_pathkeys != 0 && n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Estimate fraction of outer tuples be fetched to start returning
! 		 * tuples from partial sort.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		if (costs_cmp > 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
+ 
  	return matched_path;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1447,1455 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1556,1563 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1460,1472 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1568,1579 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index fda4df6..1b52b2e
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 163,168 ****
--- 163,169 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 811,816 ****
--- 812,818 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 844,851 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 846,855 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2448,2456 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2452,2462 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2461,2469 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2467,2477 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 4038,4043 ****
--- 4046,4052 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4047,4053 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4056,4063 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4061,4066 ****
--- 4071,4077 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4389,4395 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4400,4406 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4409,4415 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4420,4426 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4452,4458 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4463,4469 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4474,4480 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4485,4492 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4507,4513 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4519,4525 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 373e6cc..f475e99
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
*************** build_minmax_path(PlannerInfo *root, Min
*** 504,510 ****
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 504,512 ----
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  subroot,
! 												  final_rel->rows);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index c0ec905..66671f0
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** static Plan *build_grouping_chain(Planne
*** 134,140 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan);
  
  /*****************************************************************************
   *
--- 134,142 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys);
  
  /*****************************************************************************
   *
*************** grouping_planner(PlannerInfo *root, doub
*** 1767,1773 ****
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1769,1777 ----
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  root,
! 													  path_rows);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1783,1792 ****
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1787,1800 ----
  		if (sorted_path)
  		{
  			Path		sort_path;		/* dummy for result of cost_sort */
+ 			Path		partial_sort_path;	/* dummy for result of cost_sort */
+ 			int			n_common_pathkeys;
+ 
+ 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
+ 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1796,1807 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1804,1838 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, path_width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/* No sort needed for cheapest path */
! 				partial_sort_path.startup_cost = sorted_path->startup_cost;
! 				partial_sort_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/* Figure cost for sorting */
! 				cost_sort(&partial_sort_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, path_width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&partial_sort_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1896,1908 ****
  			AttrNumber *groupColIdx = NULL;
  			bool		need_tlist_eval = true;
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  				need_sort_for_grouping = true;
  
  			/*
--- 1927,1942 ----
  			AttrNumber *groupColIdx = NULL;
  			bool		need_tlist_eval = true;
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  				need_sort_for_grouping = true;
  
  			/*
*************** grouping_planner(PlannerInfo *root, doub
*** 2036,2042 ****
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan);
  			}
  			else if (parse->groupClause)
  			{
--- 2070,2078 ----
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan,
! 												   root->group_pathkeys,
! 												   n_common_pathkeys_grouping);
  			}
  			else if (parse->groupClause)
  			{
*************** grouping_planner(PlannerInfo *root, doub
*** 2053,2059 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 2089,2097 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 2191,2203 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 2229,2245 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 2346,2364 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 2388,2408 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 2374,2385 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 2418,2432 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** build_grouping_chain(PlannerInfo *root,
*** 2481,2487 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
--- 2528,2536 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
*************** build_grouping_chain(PlannerInfo *root,
*** 2502,2508 ****
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan);
  	}
  
  	/*
--- 2551,2559 ----
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan,
! 									 path_keys,
! 									 n_common_pathkeys);
  	}
  
  	/*
*************** build_grouping_chain(PlannerInfo *root,
*** 2533,2539 ****
  				make_sort_from_groupcols(root,
  										 groupClause,
  										 new_grpColIdx,
! 										 result_plan);
  
  			/*
  			 * sort_plan includes the cost of result_plan, which is not what
--- 2584,2592 ----
  				make_sort_from_groupcols(root,
  										 groupClause,
  										 new_grpColIdx,
! 										 result_plan,
! 										 NIL,
! 										 0);
  
  			/*
  			 * sort_plan includes the cost of result_plan, which is not what
*************** choose_hashed_grouping(PlannerInfo *root
*** 3648,3653 ****
--- 3701,3707 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY
*************** choose_hashed_grouping(PlannerInfo *root
*** 3729,3735 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3783,3790 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 3745,3753 ****
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 3800,3811 ----
  		sorted_p.total_cost = cheapest_path->total_cost;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 3762,3771 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3820,3831 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 3818,3823 ****
--- 3878,3884 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 3883,3889 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3944,3951 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 3900,3922 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 3962,3991 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 4706,4713 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 4775,4783 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 31db35c..66e82ec
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 823,829 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 823,829 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan), plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e509a1a..24caf36
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 865,871 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 865,872 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 1097a18..a3547dd
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** create_merge_append_path(PlannerInfo *ro
*** 1269,1280 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1269,1281 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1288,1293 ****
--- 1289,1296 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->parent->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1513,1519 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
--- 1516,1523 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  rel->width,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index a30e170..ea4684f
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_end(Tuplesortstate *state)
*** 1076,1081 ****
--- 1076,1101 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ 	state->bounded = false;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 1a44085..0075be5
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** struct Path;					/* avoid including rela
*** 104,112 ****
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype);
  
  /*
   * prototypes from functions in execCurrent.c
--- 104,112 ----
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode, Plan *node);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 07cd20a..fa9ae2e
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1789,1794 ****
--- 1789,1801 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ typedef struct SkipKeyData
+ {
+ 	FunctionCallInfoData	fcinfo;
+ 	FmgrInfo				flinfo;
+ 	OffsetNumber			attno;
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1800,1808 ****
--- 1807,1819 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index e823c83..7a9b57c
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 679,684 ****
--- 679,685 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 78c7cae..e7ae3ea
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 20474c3..f45382f
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 169,181 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 169,184 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  PlannerInfo *root,
! 										  double tuples);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index 7ae7367..56c6fcd
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 52,62 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 52,63 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index d31d994..cf65e1e
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 24,29 ****
--- 24,30 ----
  #include "executor/tuptable.h"
  #include "fmgr.h"
  #include "utils/relcache.h"
+ #include "utils/sortsupport.h"
  
  
  /* Tuplesortstate is an opaque type whose details are not known outside
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 89b6c1c..bc49011
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** SELECT thousand, thousand FROM tenk1
*** 1354,1366 ****
  ORDER BY thousand, tenthous;
                                 QUERY PLAN                                
  -------------------------------------------------------------------------
!  Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Sort
!          Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1354,1367 ----
  ORDER BY thousand, tenthous;
                                 QUERY PLAN                                
  -------------------------------------------------------------------------
!  Partial sort
     Sort Key: tenk1.thousand, tenk1.tenthous
!    Presorted Key: tenk1.thousand
!    ->  Merge Append
!          Sort Key: tenk1.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on tenk1
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** SELECT x, y FROM
*** 1436,1450 ****
     UNION ALL
     SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
  ORDER BY x, y;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Merge Append
     Sort Key: a.thousand, a.tenthous
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Sort
!          Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1437,1452 ----
     UNION ALL
     SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
  ORDER BY x, y;
!                             QUERY PLAN                             
! -------------------------------------------------------------------
!  Partial sort
     Sort Key: a.thousand, a.tenthous
!    Presorted Key: a.thousand
!    ->  Merge Append
!          Sort Key: a.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
#75Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Peter Geoghegan (#73)
Re: PoC: Partial sort

Hi!

On Sat, Jan 23, 2016 at 10:07 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Sat, Jan 23, 2016 at 4:07 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

The main thing I'm particularly interested in is how much is this coupled
with the Sort node, and whether it's possible to feed partially sorted
tuples into other nodes.

That's cool, but I'm particularly interested in seeing Alexander get
back to this because it's an important project on its own. We should
really have this.

Thank you for your review and interest in this patch! I'm sorry for huge
delay I made. I'm going to get back to this soon.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#76Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alexander Korotkov (#74)
Re: PoC: Partial sort

Alexander Korotkov wrote:

I'm sorry that I didn't found time for this yet. I'm certainly planning to
get back to this in near future. The attached version is just rebased
without any optimization.

Great to have a new version -- there seems to be a lot of interest in
this patch. I'm moving this one to the next commitfest, thanks.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#76)
Re: PoC: Partial sort

On Sun, Jan 31, 2016 at 4:16 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Great to have a new version -- there seems to be a lot of interest in
this patch. I'm moving this one to the next commitfest, thanks.

I am signed up to review this patch.

I was very surprised to see it in "Needs Review" state in the CF app
(Alexander just rebased the patch, and didn't do anything with the CF
app entry). Once again, this seems to have happened just because
Alvaro moved the patch to the next CF.

I've marked it "Waiting on Author" once more. Hopefully the CF app
will be fixed soon, so moving a patch to the next commitfest no longer
clobbers its state.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#71)
1 attachment(s)
Re: PoC: Partial sort

Hi, Peter!

I finally went over your review.

On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg@heroku.com> wrote:

Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=429.54..434.53 rows=100 width=16) (actual
time=15125.819..22414.038 rows=100 loops=1)
-> Partial sort (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)
Sort Key: v1, v2
Presorted Key: v1
Sort Method: top-N heapsort Memory: 27kB
-> Index Scan using test_v1_idx on test
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)
Planning time: 0.948 ms
Execution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

I agree with your notes. In the attached version of path explain output was
revised as you proposed.

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

With partial sort we run multiple sorts in the same node. Ideally, we need
to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple
times. I checked how it works for now.

# explain analyze select (select sum(x.i) from (select i from
generate_series(1,t * 1000) i order by i desc) x) from generate_series(1,
20, 1) t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series t (cost=0.00..74853.92 rows=1000
width=4) (actual time=0.777..83.498 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=74.83..74.84 rows=1 width=4) (actual
time=4.173..4.173 rows=1 loops=20)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual
time=2.822..3.361 rows=10500 loops=20)
Sort Key: i.i
Sort Method: quicksort Memory: 1706kB
-> Function Scan on generate_series i (cost=0.01..10.01
rows=1000 width=4) (actual time=0.499..1.106 rows=10500 loops=20)
Planning time: 0.080 ms
Execution time: 83.625 ms
(9 rows)

# explain analyze select (select sum(x.i) from (select i from
generate_series(1,t * 1000) i order by i desc) x) from generate_series(20,
1, -1) t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series t (cost=0.00..74853.92 rows=1000
width=4) (actual time=11.414..86.127 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=74.83..74.84 rows=1 width=4) (actual
time=4.305..4.305 rows=1 loops=20)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual
time=2.944..3.486 rows=10500 loops=20)
Sort Key: i.i
Sort Method: quicksort Memory: 71kB
-> Function Scan on generate_series i (cost=0.01..10.01
rows=1000 width=4) (actual time=0.527..1.125 rows=10500 loops=20)
Planning time: 0.080 ms
Execution time: 86.165 ms
(9 rows)

In the case of subplan explain analyze gives us just information about last
subplan run. This makes me uneasy. From one side, it's probably OK that
partial sort behaves like subplan while showing information just about last
sort run. From the other side, we need some better solution for that in
general case.

Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Agree. I made

Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+               case T_Sort:
+                       /* We shouldn't reach here without having plan
node */
+                       Assert(node);
+                       /* With skipCols sort node holds only last bucket
*/
+                       if (node && ((Sort *)node)->skipCols == 0)
+                               return true;
+                       else
+                               return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

I don't like this too. But I didn't find better solution without
significant rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such
rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.

There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:

/*
+        * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+        * or EXEC_FLAG_MARK, because we hold only current bucket in
+        * tuplesortstate.
+        */
+       Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+
EXEC_FLAG_BACKWARD |
+
EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

It has to be dynamic if we want to keep full sort and partial sort in the
same node. If properties of full sort and partial sort are different then
and they share same node then this properties of Sort node have to be
dynamic.
Alternative idea I have is that Sort node should fallback to full sort if
it sees any of above flags. But I'm not sure this is right. In some cases
it might be cheaper to partial sort then materialize than fallback to full
sort.

Misc. issues

----------------

_readSort() needs READ_INT_FIELD(). _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.

Fixed. I've tried "#define COPY_PARSE_PLAN_TREES", now regression tests are
passed with it.

tuplesort.h should not include sortsupport.h. It's a modularity

violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

Fixed.

What is this?

+               if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+                       elog(ERROR, "Sort");

It's just piece of junk I used for debug. Deleted.

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments. Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

I've added some comments.

* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common(). I think that
comments should make this quite clear.

That's it. Explicit comment about that was added.

* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

Now compare_bifractional_path_costs() is together with

* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch. ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

Now grouping_planner() should call separate function
estimate_pathkeys_groups() which is responsible for estimating number of
groups.

* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.

Idea is that since partial sort fetches data per group then it would
require fetching more data than fully presorted path.

* Within planner.c, "partial_sort_path" variable name in

grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

I've tried to make it more clear. partial_sort_path is renamed
to presorted_path.

Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

Unique paths occasionally can use this optimization.

# create table test as (select id, random() as v from
generate_series(1,1000000) id);
# create index test_v_idx on test(v);

# explain select distinct v, id from test;
QUERY PLAN
----------------------------------------------------------------------------------------------
Unique (cost=0.47..55104.41 rows=1000000 width=12)
-> Sort (cost=0.47..50104.41 rows=1000000 width=12)
Sort Key: v, id
Presorted Key: v
-> Index Scan using test_v_idx on test (cost=0.42..47604.41
rows=1000000 width=12)
(5 rows)

# explain select distinct id, v from test;
QUERY PLAN
---------------------------------------------------------------------------
Unique (cost=132154.34..139654.34 rows=1000000 width=12)
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: id, v
-> Seq Scan on test (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

But it depends on attribute order. I could work out this case, but I would
prefer some simple case to commit before. I already throw merge join
optimization away for the sake of simplicity.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-6.patchapplication/octet-stream; name=partial-sort-basic-6.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index ee13136..f5621df
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 89,95 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 89,95 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** show_sort_keys(SortState *sortstate, Lis
*** 1750,1756 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1750,1756 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1766,1772 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1766,1772 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1790,1796 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1790,1796 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1846,1852 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1846,1852 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1903,1909 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1903,1909 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1916,1928 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1916,1929 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 1962,1970 ****
--- 1963,1975 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2112,2123 ****
--- 2117,2137 ----
  			appendStringInfoSpaces(es->str, es->indent * 2);
  			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
  							 sortMethod, spaceType, spaceUsed);
+ 			if (sortstate->skipKeys)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str, "Sort groups: %ld\n",
+ 								 sortstate->groupsCount);
+ 			}
  		}
  		else
  		{
  			ExplainPropertyText("Sort Method", sortMethod, es);
  			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
  			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			if (sortstate->skipKeys)
+ 				ExplainPropertyLong("Sort groups: %ld",
+ 									sortstate->groupsCount, es);
  		}
  	}
  }
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 35864c1..951ea69
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecRestrPos(PlanState *node)
*** 383,389 ****
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
--- 383,389 ----
   * know which plan types support mark/restore.
   */
  bool
! ExecSupportsMarkRestore(Path *pathnode, Plan *node)
  {
  	/*
  	 * For consistency with the routines above, we do not examine the nodeTag
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 395,403 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 395,411 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 508,517 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 516,531 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 572,578 ****
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 586,592 ----
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecMaterializesOutput(NodeTag plantype)
*** 580,588 ****
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
- 		case T_Sort:
  			return true;
  
  		default:
  			break;
  	}
--- 594,610 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		default:
  			break;
  	}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 03aa20f..c22610c
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 556,561 ****
--- 556,562 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 634,640 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 635,641 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index a34dcc5..d9d0f61
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,112 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 129,139 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,132 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 146,300 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 	{
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 		node->groupsCount++;
! 	}
! 	else
! 	{
! 		/* Support structures for cmpSortSkipCols - already sorted columns */
! 		if (skipCols)
! 			prepareSkipCols(plannode, node);
  
+ 		/*
+ 		 * Only pass on remaining columns that are unsorted.  Skip abbreviated
+ 		 * keys usage for partial sort.  We unlikely will have huge groups
+ 		 * with partial sort.  Therefore usage of abbreviated keys would be
+ 		 * likely a waste of time.
+ 		 */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols - skipCols,
! 											  &(plannode->sortColIdx[skipCols]),
! 											  &(plannode->sortOperators[skipCols]),
! 											  &(plannode->collations[skipCols]),
! 											  &(plannode->nullsFirst[skipCols]),
  											  work_mem,
! 											  node->randomAccess,
! 											  skipCols > 0 ? true : false);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
  
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
  
! 	/*
! 	 * Put next group of tuples where skipCols sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else if (node->prev)
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 			nTuples++;
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 325,339 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 351,362 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
+ 	sortstate->groupsCount = 0;
+ 	sortstate->skipKeys = NULL;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 500,506 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index a9e9cc3..86b9c01
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 830,835 ****
--- 830,836 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 85acce8..def5520
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outSort(StringInfo str, const Sort *nod
*** 798,803 ****
--- 798,804 ----
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
+ 	WRITE_INT_FIELD(skipCols);
  
  	appendStringInfoString(str, " :sortColIdx");
  	for (i = 0; i < node->numCols; i++)
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index e6e6f29..aadde14
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readSort(void)
*** 1961,1966 ****
--- 1961,1967 ----
  	ReadCommonPlan(&local_node->plan);
  
  	READ_INT_FIELD(numCols);
+ 	READ_INT_FIELD(skipCols);
  	READ_ATTRNUMBER_ARRAY(sortColIdx, local_node->numCols);
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 5fc2f9c..42c9c64
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Plan *runion, Plan 
*** 1414,1419 ****
--- 1414,1426 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or partial sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Plan *runion, Plan 
*** 1440,1446 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1447,1454 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Plan *runion, Plan 
*** 1456,1470 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1464,1485 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1494,1506 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1509,1558 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1510,1516 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1562,1568 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1521,1530 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1573,1582 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1532,1545 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1584,1609 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2284,2289 ****
--- 2348,2355 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2310,2315 ****
--- 2376,2383 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
*************** final_cost_mergejoin(PlannerInfo *root, 
*** 2521,2527 ****
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path))
  		path->materialize_inner = true;
  
  	/*
--- 2589,2595 ----
  	 * it off does not entitle us to deliver an invalid plan.
  	 */
  	else if (innersortkeys == NIL &&
! 			 !ExecSupportsMarkRestore(inner_path, NULL))
  		path->materialize_inner = true;
  
  	/*
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 3044,3050 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3112,3118 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan), plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index 3b898da..6cdd6ea
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** match_unsorted_outer(PlannerInfo *root,
*** 889,895 ****
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
--- 889,895 ----
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index eed39b9..4ae1309
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 309,314 ****
--- 310,341 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 395,406 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied partially then we would have to do partial
!  *	  sort in order to satisfy pathkeys completely.  Since partial sort
!  *	  consumes data by presorted groups, we would have to consume more data
!  *	  than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 378,409 ****
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
  
--- 409,480 ----
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
+  * 'num_groups' array of group numbers which pathkeys divide data to. Should
+  *	  be estimated using estimate_partialsort_groups().
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	double		matched_fraction;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 
+ 		if (n_pathkeys != 0 && n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Partial sort consumes data not per tuple but per presorted group.
! 		 * Increase fraction of tuples we have to read from source path by
! 		 * one presorted group.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison assuming paths could have different number
! 		 * of required pathkeys and therefore different fraction of tuples
! 		 * to fetch.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		/*
! 		 * Cheaper path with matching outer becomes a new leader.
! 		 */
! 		if (costs_cmp > 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
+ 
  	return matched_path;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1447,1455 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1518,1525 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1460,1472 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1530,1541 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 198b06b..3da150d
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 163,168 ****
--- 163,169 ----
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
  static Sort *make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+ 		  List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples);
*************** create_merge_append_plan(PlannerInfo *ro
*** 815,820 ****
--- 816,822 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		subplan = create_plan_recurse(root, subpath);
*************** create_merge_append_plan(PlannerInfo *ro
*** 848,855 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
--- 850,859 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  			subplan = (Plan *) make_sort(root, subplan, numsortkeys,
+ 										 pathkeys, n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst,
  										 best_path->limit_tuples);
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2469,2477 ****
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									outer_plan,
! 									best_path->outersortkeys,
! 									-1.0);
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
--- 2473,2483 ----
  		disuse_physical_tlist(root, outer_plan, best_path->jpath.outerjoinpath);
  		outer_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								outer_plan,
! 								best_path->outersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys));
  		outerpathkeys = best_path->outersortkeys;
  	}
  	else
*************** create_mergejoin_plan(PlannerInfo *root,
*** 2482,2490 ****
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 									inner_plan,
! 									best_path->innersortkeys,
! 									-1.0);
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
--- 2488,2498 ----
  		disuse_physical_tlist(root, inner_plan, best_path->jpath.innerjoinpath);
  		inner_plan = (Plan *)
  			make_sort_from_pathkeys(root,
! 								inner_plan,
! 								best_path->innersortkeys,
! 								-1.0,
! 								pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys));
  		innerpathkeys = best_path->innersortkeys;
  	}
  	else
*************** make_mergejoin(List *tlist,
*** 4059,4064 ****
--- 4067,4073 ----
   */
  static Sort *
  make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
+           List *pathkeys, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst,
  		  double limit_tuples)
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4068,4074 ****
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4077,4084 ----
  	Path		sort_path;		/* dummy for result of cost_sort */
  
  	copy_plan_costsize(plan, lefttree); /* only care about copying size */
! 	cost_sort(&sort_path, root, pathkeys, skipCols,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_sort(PlannerInfo *root, Plan *leftt
*** 4082,4087 ****
--- 4092,4098 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 4410,4416 ****
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 4421,4427 ----
   */
  Sort *
  make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
! 						double limit_tuples, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(PlannerInfo *roo
*** 4430,4436 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
--- 4441,4447 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, limit_tuples);
  }
*************** make_sort_from_sortclauses(PlannerInfo *
*** 4473,4479 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4484,4490 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, NIL, 0,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
*************** Sort *
*** 4495,4501 ****
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 4506,4513 ----
  make_sort_from_groupcols(PlannerInfo *root,
  						 List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 List *pathkeys, int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(PlannerInfo *ro
*** 4528,4534 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
--- 4540,4546 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(root, lefttree, numsortkeys, pathkeys, skipCols,
  					 sortColIdx, sortOperators, collations,
  					 nullsFirst, -1.0);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 373e6cc..b68443f
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 43,48 ****
--- 43,49 ----
  #include "parser/parsetree.h"
  #include "parser/parse_clause.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
*************** build_minmax_path(PlannerInfo *root, Min
*** 409,414 ****
--- 410,416 ----
  	Path	   *sorted_path;
  	Cost		path_cost;
  	double		path_fraction;
+ 	double	   *psort_num_groups;
  
  	/*----------
  	 * Generate modified query of the form
*************** build_minmax_path(PlannerInfo *root, Min
*** 500,510 ****
  	else
  		path_fraction = 1.0;
  
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 502,516 ----
  	else
  		path_fraction = 1.0;
  
+ 	psort_num_groups = estimate_pathkeys_groups(subroot->query_pathkeys,
+ 												subroot,
+ 												final_rel->rows);
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  psort_num_groups);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 65b99e2..23d0aa4
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** static Plan *build_grouping_chain(Planne
*** 141,147 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan);
  
  /*****************************************************************************
   *
--- 141,149 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys);
  
  /*****************************************************************************
   *
*************** grouping_planner(PlannerInfo *root, doub
*** 1481,1486 ****
--- 1483,1489 ----
  		Path	   *cheapest_path;
  		Path	   *sorted_path;
  		Path	   *best_path;
+ 		double	   *psort_num_groups;
  
  		MemSet(&agg_costs, 0, sizeof(AggClauseCosts));
  
*************** grouping_planner(PlannerInfo *root, doub
*** 1815,1825 ****
  		 */
  		cheapest_path = final_rel->cheapest_total_path;
  
  		sorted_path =
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
--- 1818,1832 ----
  		 */
  		cheapest_path = final_rel->cheapest_total_path;
  
+ 		psort_num_groups = estimate_pathkeys_groups(root->query_pathkeys,
+ 													root,
+ 													path_rows);
  		sorted_path =
  			get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  													  root->query_pathkeys,
  													  NULL,
! 													  tuple_fraction,
! 													  psort_num_groups);
  
  		/* Don't consider same path in both guises; just wastes effort */
  		if (sorted_path == cheapest_path)
*************** grouping_planner(PlannerInfo *root, doub
*** 1834,1844 ****
  		 */
  		if (sorted_path)
  		{
! 			Path		sort_path;		/* dummy for result of cost_sort */
  
  			if (root->query_pathkeys == NIL ||
! 				pathkeys_contained_in(root->query_pathkeys,
! 									  cheapest_path->pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
--- 1841,1860 ----
  		 */
  		if (sorted_path)
  		{
! 			/* dummy for result of cost_sort */
! 			Path		sort_path;
! 			/*
! 			 * dummy for original cost of fully presorted path or
! 			 * result of cost_sort in case of partial sort
! 			 */
! 			Path		presorted_path;	
! 			int			n_common_pathkeys;
! 
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												cheapest_path->pathkeys);
  
  			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
  			{
  				/* No sort needed for cheapest path */
  				sort_path.startup_cost = cheapest_path->startup_cost;
*************** grouping_planner(PlannerInfo *root, doub
*** 1848,1859 ****
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
  						  cheapest_path->total_cost,
  						  path_rows, cheapest_path->pathtarget->width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			if (compare_fractional_path_costs(sorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
--- 1864,1904 ----
  			{
  				/* Figure cost for sorting */
  				cost_sort(&sort_path, root, root->query_pathkeys,
+ 						  n_common_pathkeys,
+ 						  cheapest_path->startup_cost,
  						  cheapest_path->total_cost,
  						  path_rows, cheapest_path->pathtarget->width,
  						  0.0, work_mem, root->limit_tuples);
  			}
  
! 			n_common_pathkeys = pathkeys_common(root->query_pathkeys,
! 												sorted_path->pathkeys);
! 
! 			if (root->query_pathkeys == NIL ||
! 					n_common_pathkeys == list_length(root->query_pathkeys))
! 			{
! 				/*
! 				 * Presorted path fully match query pathkeys.
! 				 * No partial sort is needed.
! 				 */
! 				presorted_path.startup_cost = sorted_path->startup_cost;
! 				presorted_path.total_cost = sorted_path->total_cost;
! 			}
! 			else
! 			{
! 				/*
! 				 * Figure cost for sorting when presorted path only partially
! 				 * match query pathkeys.
! 				 */
! 				cost_sort(&presorted_path, root, root->query_pathkeys,
! 						  n_common_pathkeys,
! 						  sorted_path->startup_cost,
! 						  sorted_path->total_cost,
! 						  path_rows, sorted_path->pathtarget->width,
! 						  0.0, work_mem, root->limit_tuples);
! 			}
! 
! 			if (compare_fractional_path_costs(&presorted_path, &sort_path,
  											  tuple_fraction) > 0)
  			{
  				/* Presorted path is a loser */
*************** grouping_planner(PlannerInfo *root, doub
*** 1950,1962 ****
  			AttrNumber *groupColIdx = NULL;
  			bool		need_tlist_eval = true;
  			bool		need_sort_for_grouping = false;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
  			if (parse->groupClause && !use_hashed_grouping &&
! 			  !pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  				need_sort_for_grouping = true;
  
  			/*
--- 1995,2010 ----
  			AttrNumber *groupColIdx = NULL;
  			bool		need_tlist_eval = true;
  			bool		need_sort_for_grouping = false;
+ 			int			n_common_pathkeys_grouping;
  
  			result_plan = create_plan(root, best_path);
  			current_pathkeys = best_path->pathkeys;
  
  			/* Detect if we'll need an explicit sort for grouping */
+ 			n_common_pathkeys_grouping = pathkeys_common(root->group_pathkeys,
+ 														 current_pathkeys);
  			if (parse->groupClause && !use_hashed_grouping &&
! 				n_common_pathkeys_grouping < list_length(root->group_pathkeys))
  				need_sort_for_grouping = true;
  
  			/*
*************** grouping_planner(PlannerInfo *root, doub
*** 2090,2096 ****
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan);
  			}
  			else if (parse->groupClause)
  			{
--- 2138,2146 ----
  												   groupColIdx,
  												   &agg_costs,
  												   numGroups,
! 												   result_plan,
! 												   root->group_pathkeys,
! 												   n_common_pathkeys_grouping);
  			}
  			else if (parse->groupClause)
  			{
*************** grouping_planner(PlannerInfo *root, doub
*** 2107,2113 ****
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan);
  					current_pathkeys = root->group_pathkeys;
  				}
  
--- 2157,2165 ----
  						make_sort_from_groupcols(root,
  												 parse->groupClause,
  												 groupColIdx,
! 												 result_plan,
! 												 root->group_pathkeys,
! 												 n_common_pathkeys_grouping);
  					current_pathkeys = root->group_pathkeys;
  				}
  
*************** grouping_planner(PlannerInfo *root, doub
*** 2245,2257 ****
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0);
! 					if (!pathkeys_contained_in(window_pathkeys,
! 											   current_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
--- 2297,2313 ----
  				if (window_pathkeys)
  				{
  					Sort	   *sort_plan;
+ 					int			n_common_pathkeys;
+ 
+ 					n_common_pathkeys = pathkeys_common(window_pathkeys,
+ 													    current_pathkeys);
  
  					sort_plan = make_sort_from_pathkeys(root,
  														result_plan,
  														window_pathkeys,
! 														-1.0,
! 														n_common_pathkeys);
! 					if (n_common_pathkeys < list_length(window_pathkeys))
  					{
  						/* we do indeed need to sort */
  						result_plan = (Plan *) sort_plan;
*************** grouping_planner(PlannerInfo *root, doub
*** 2401,2419 ****
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					current_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					current_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 current_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															current_pathkeys,
! 															   -1.0);
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
--- 2457,2477 ----
  			{
  				if (list_length(root->distinct_pathkeys) >=
  					list_length(root->sort_pathkeys))
! 					needed_pathkeys = root->distinct_pathkeys;
  				else
  				{
! 					needed_pathkeys = root->sort_pathkeys;
  					/* Assert checks that parser didn't mess up... */
  					Assert(pathkeys_contained_in(root->distinct_pathkeys,
! 												 needed_pathkeys));
  				}
  
  				result_plan = (Plan *) make_sort_from_pathkeys(root,
  															   result_plan,
! 															   needed_pathkeys,
! 															   -1.0,
! 							pathkeys_common(needed_pathkeys, current_pathkeys));
! 				current_pathkeys = needed_pathkeys;
  			}
  
  			result_plan = (Plan *) make_unique(result_plan,
*************** grouping_planner(PlannerInfo *root, doub
*** 2429,2440 ****
  	 */
  	if (parse->sortClause)
  	{
! 		if (!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
--- 2487,2501 ----
  	 */
  	if (parse->sortClause)
  	{
! 		int common = pathkeys_common(root->sort_pathkeys, current_pathkeys);
! 		
! 		if (common < list_length(root->sort_pathkeys))
  		{
  			result_plan = (Plan *) make_sort_from_pathkeys(root,
  														   result_plan,
  														 root->sort_pathkeys,
! 														   limit_tuples,
! 														   common);
  			current_pathkeys = root->sort_pathkeys;
  		}
  	}
*************** build_grouping_chain(PlannerInfo *root,
*** 2536,2542 ****
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
--- 2597,2605 ----
  					 AttrNumber *groupColIdx,
  					 AggClauseCosts *agg_costs,
  					 long numGroups,
! 					 Plan *result_plan,
! 					 List *path_keys,
! 					 int n_common_pathkeys)
  {
  	AttrNumber *top_grpColIdx = groupColIdx;
  	List	   *chain = NIL;
*************** build_grouping_chain(PlannerInfo *root,
*** 2557,2563 ****
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan);
  	}
  
  	/*
--- 2620,2628 ----
  			make_sort_from_groupcols(root,
  									 llast(rollup_groupclauses),
  									 top_grpColIdx,
! 									 result_plan,
! 									 path_keys,
! 									 n_common_pathkeys);
  	}
  
  	/*
*************** build_grouping_chain(PlannerInfo *root,
*** 2588,2594 ****
  				make_sort_from_groupcols(root,
  										 groupClause,
  										 new_grpColIdx,
! 										 result_plan);
  
  			/*
  			 * sort_plan includes the cost of result_plan, which is not what
--- 2653,2661 ----
  				make_sort_from_groupcols(root,
  										 groupClause,
  										 new_grpColIdx,
! 										 result_plan,
! 										 NIL,
! 										 0);
  
  			/*
  			 * sort_plan includes the cost of result_plan, which is not what
*************** choose_hashed_grouping(PlannerInfo *root
*** 3859,3864 ****
--- 3926,3932 ----
  	List	   *current_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  	int			sorted_p_width;
  
  	/*
*************** choose_hashed_grouping(PlannerInfo *root
*** 3942,3948 ****
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
  				  dNumGroups, cheapest_path->pathtarget->width,
  				  0.0, work_mem, limit_tuples);
  
--- 4010,4017 ----
  			 path_rows);
  	/* Result of hashed agg is always unsorted */
  	if (target_pathkeys)
! 		cost_sort(&hashed_p, root, target_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumGroups, cheapest_path->pathtarget->width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_grouping(PlannerInfo *root
*** 3960,3968 ****
  		sorted_p_width = cheapest_path->pathtarget->width;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 	if (!pathkeys_contained_in(root->group_pathkeys, current_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys, sorted_p.total_cost,
  				  path_rows, sorted_p_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
--- 4029,4040 ----
  		sorted_p_width = cheapest_path->pathtarget->width;
  		current_pathkeys = cheapest_path->pathkeys;
  	}
! 
! 	n_common_pathkeys = pathkeys_common(root->group_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(root->group_pathkeys))
  	{
! 		cost_sort(&sorted_p, root, root->group_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, sorted_p_width,
  				  0.0, work_mem, -1.0);
  		current_pathkeys = root->group_pathkeys;
*************** choose_hashed_grouping(PlannerInfo *root
*** 3977,3986 ****
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
  	/* The Agg or Group node will preserve ordering */
! 	if (target_pathkeys &&
! 		!pathkeys_contained_in(target_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, sorted_p.total_cost,
  				  dNumGroups, sorted_p_width,
  				  0.0, work_mem, limit_tuples);
  
--- 4049,4060 ----
  		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
  				   sorted_p.startup_cost, sorted_p.total_cost,
  				   path_rows);
+ 
  	/* The Agg or Group node will preserve ordering */
! 	n_common_pathkeys = pathkeys_common(target_pathkeys, current_pathkeys);
! 	if (target_pathkeys && n_common_pathkeys < list_length(target_pathkeys))
! 		cost_sort(&sorted_p, root, target_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumGroups, sorted_p_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 4035,4040 ****
--- 4109,4115 ----
  	List	   *needed_pathkeys;
  	Path		hashed_p;
  	Path		sorted_p;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * If we have a sortable DISTINCT ON clause, we always use sorting. This
*************** choose_hashed_distinct(PlannerInfo *root
*** 4101,4107 ****
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, hashed_p.total_cost,
  				  dNumDistinctRows, cheapest_path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 4176,4183 ----
  	 * need to charge for the final sort.
  	 */
  	if (parse->sortClause)
! 		cost_sort(&hashed_p, root, root->sort_pathkeys, 0,
! 				  hashed_p.startup_cost, hashed_p.total_cost,
  				  dNumDistinctRows, cheapest_path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** choose_hashed_distinct(PlannerInfo *root
*** 4118,4140 ****
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 	if (!pathkeys_contained_in(needed_pathkeys, current_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys, sorted_p.total_cost,
  				  path_rows, sorted_path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
  	if (parse->sortClause &&
! 		!pathkeys_contained_in(root->sort_pathkeys, current_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, sorted_p.total_cost,
  				  dNumDistinctRows, sorted_path_width,
  				  0.0, work_mem, limit_tuples);
  
--- 4194,4223 ----
  		needed_pathkeys = root->sort_pathkeys;
  	else
  		needed_pathkeys = root->distinct_pathkeys;
! 
! 	n_common_pathkeys = pathkeys_common(needed_pathkeys, current_pathkeys);
! 	if (n_common_pathkeys < list_length(needed_pathkeys))
  	{
  		if (list_length(root->distinct_pathkeys) >=
  			list_length(root->sort_pathkeys))
  			current_pathkeys = root->distinct_pathkeys;
  		else
  			current_pathkeys = root->sort_pathkeys;
! 		cost_sort(&sorted_p, root, current_pathkeys,
! 				  n_common_pathkeys, sorted_p.startup_cost, sorted_p.total_cost,
  				  path_rows, sorted_path_width,
  				  0.0, work_mem, -1.0);
  	}
  	cost_group(&sorted_p, root, numDistinctCols, dNumDistinctRows,
  			   sorted_p.startup_cost, sorted_p.total_cost,
  			   path_rows);
+ 
+ 
+ 	n_common_pathkeys = pathkeys_common(root->sort_pathkeys, current_pathkeys);
  	if (parse->sortClause &&
! 		n_common_pathkeys < list_length(root->sort_pathkeys))
! 		cost_sort(&sorted_p, root, root->sort_pathkeys, n_common_pathkeys,
! 				  sorted_p.startup_cost, sorted_p.total_cost,
  				  dNumDistinctRows, sorted_path_width,
  				  0.0, work_mem, limit_tuples);
  
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 4924,4931 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget.width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 5007,5015 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget.width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 31db35c..66e82ec
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 823,829 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 823,829 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan), plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index e509a1a..24caf36
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 865,871 ****
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 865,872 ----
  	sorted_p.startup_cost = input_plan->startup_cost;
  	sorted_p.total_cost = input_plan->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0,
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_plan->plan_rows, input_plan->plan_width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 9417587..0418406
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** compare_fractional_path_costs(Path *path
*** 124,129 ****
--- 124,170 ----
  }
  
  /*
+  * compare_bifractional_path_costs
+  *	  Return -1, 0, or +1 according as fetching the fraction1 tuples of path1 is
+  *	  cheaper, the same cost, or more expensive than fetching fraction2 tuples
+  *	  of path2.
+  *
+  * fraction1 and fraction2 are fractions of total tuples between 0 and 1.
+  * If fraction is <= 0 or > 1, we interpret it as 1, ie, we select the
+  * path with the cheaper total_cost.
+  */
+ 
+ /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 								double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0)
+ 		fraction1 = 1.0;
+ 	if (fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		fraction2 = 1.0;
+ 
+ 	if (fraction1 == 1.0 && fraction2 == 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * compare_path_costs_fuzzily
   *	  Compare the costs of two paths to see if either can be said to
   *	  dominate the other.
*************** create_merge_append_path(PlannerInfo *ro
*** 1278,1289 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1319,1331 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1302 ****
--- 1339,1346 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1528,1534 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1572,1579 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index fe44d56..8d1717c
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 276,282 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 46c95b0..9df7d1e
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3463,3468 ****
--- 3463,3504 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 67d86ed..6260e44
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 614,620 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
--- 614,621 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 662,668 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 663,669 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_end(Tuplesortstate *state)
*** 1076,1081 ****
--- 1077,1102 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ 	state->bounded = false;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 1a44085..0075be5
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** struct Path;					/* avoid including rela
*** 104,112 ****
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype);
  
  /*
   * prototypes from functions in execCurrent.c
--- 104,112 ----
  extern void ExecReScan(PlanState *node);
  extern void ExecMarkPos(PlanState *node);
  extern void ExecRestrPos(PlanState *node);
! extern bool ExecSupportsMarkRestore(struct Path *pathnode, Plan *node);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index 064a050..c3c1692
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1801,1806 ****
--- 1801,1813 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ typedef struct SkipKeyData
+ {
+ 	FunctionCallInfoData	fcinfo;
+ 	FmgrInfo				flinfo;
+ 	OffsetNumber			attno;
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1812,1820 ****
--- 1819,1832 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
+ 	long		groupsCount;
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index ae224cf..2fa20ed
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 680,685 ****
--- 680,686 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 78c7cae..e7ae3ea
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
new file mode 100644
index f479981..e698e2c
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
*************** extern int compare_path_costs(Path *path
*** 24,29 ****
--- 24,31 ----
  				   CostSelector criterion);
  extern int compare_fractional_path_costs(Path *path1, Path *path2,
  							  double fraction);
+ extern int compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2);
  extern void set_cheapest(RelOptInfo *parent_rel);
  extern void add_path(RelOptInfo *parent_rel, Path *new_path);
  extern bool add_path_precheck(RelOptInfo *parent_rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 20474c3..cd117ca
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 169,181 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 169,183 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
new file mode 100644
index eaa642b..c8ff0bf
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
*************** extern RecursiveUnion *make_recursive_un
*** 61,71 ****
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
--- 61,72 ----
  					 Plan *lefttree, Plan *righttree, int wtParam,
  					 List *distinctList, long numGroups);
  extern Sort *make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree,
! 						List *pathkeys, double limit_tuples, int skipCols);
  extern Sort *make_sort_from_sortclauses(PlannerInfo *root, List *sortcls,
  						   Plan *lefttree);
  extern Sort *make_sort_from_groupcols(PlannerInfo *root, List *groupcls,
! 						 AttrNumber *grpColIdx, Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  extern Agg *make_agg(PlannerInfo *root, List *tlist, List *qual,
  		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
  		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 06fbca7..3ee58ed
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 188,193 ****
--- 188,196 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5cecd6d..6476504
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index e434c5d..86a15a1
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 897,911 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 897,914 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Sort
!          Sort Key: t1.a, t1.b, t2.x, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 89b6c1c..3c2b0ad
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** SELECT thousand, thousand FROM tenk1
*** 1354,1366 ****
  ORDER BY thousand, tenthous;
                                 QUERY PLAN                                
  -------------------------------------------------------------------------
!  Merge Append
     Sort Key: tenk1.thousand, tenk1.tenthous
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!    ->  Sort
!          Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1354,1367 ----
  ORDER BY thousand, tenthous;
                                 QUERY PLAN                                
  -------------------------------------------------------------------------
!  Sort
     Sort Key: tenk1.thousand, tenk1.tenthous
!    Presorted Key: tenk1.thousand
!    ->  Merge Append
!          Sort Key: tenk1.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on tenk1
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** SELECT x, y FROM
*** 1436,1450 ****
     UNION ALL
     SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
  ORDER BY x, y;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Merge Append
     Sort Key: a.thousand, a.tenthous
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
!    ->  Sort
!          Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1437,1452 ----
     UNION ALL
     SELECT unique2 AS x, unique2 AS y FROM tenk1 b) s
  ORDER BY x, y;
!                             QUERY PLAN                             
! -------------------------------------------------------------------
!  Sort
     Sort Key: a.thousand, a.tenthous
!    Presorted Key: a.thousand
!    ->  Merge Append
!          Sort Key: a.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
#79Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#78)
1 attachment(s)
Re: PoC: Partial sort

Hi!

Tom committed upper planner pathification patch.
Partial sort patch rebased to master is attached. It was quite huge rebase
in planner part of the patch. But I think now patch becomes better, much
more logical.
It's probably, something was missed after rebase. I'm going to examine
this path more carefully next week.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-7.patchapplication/octet-stream; name=partial-sort-basic-7.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index ee13136..f5621df
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 89,95 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 89,95 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** show_sort_keys(SortState *sortstate, Lis
*** 1750,1756 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1750,1756 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1766,1772 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1766,1772 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1790,1796 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1790,1796 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1846,1852 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1846,1852 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1903,1909 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1903,1909 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1916,1928 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1916,1929 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 1962,1970 ****
--- 1963,1975 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2112,2123 ****
--- 2117,2137 ----
  			appendStringInfoSpaces(es->str, es->indent * 2);
  			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
  							 sortMethod, spaceType, spaceUsed);
+ 			if (sortstate->skipKeys)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str, "Sort groups: %ld\n",
+ 								 sortstate->groupsCount);
+ 			}
  		}
  		else
  		{
  			ExplainPropertyText("Sort Method", sortMethod, es);
  			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
  			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			if (sortstate->skipKeys)
+ 				ExplainPropertyLong("Sort groups: %ld",
+ 									sortstate->groupsCount, es);
  		}
  	}
  }
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 0c8e939..eaf54d7
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 395,403 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 395,409 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((SortPath *)pathnode)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 511,520 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 517,532 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 575,581 ****
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype)
  {
  	switch (plantype)
  	{
--- 587,593 ----
   * very low per-tuple cost.
   */
  bool
! ExecMaterializesOutput(NodeTag plantype, Plan *node)
  {
  	switch (plantype)
  	{
*************** ExecMaterializesOutput(NodeTag plantype)
*** 583,591 ****
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
- 		case T_Sort:
  			return true;
  
  		default:
  			break;
  	}
--- 595,611 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
  			return true;
  
+ 		case T_Sort:
+ 			/* We shouldn't reach here without having plan node */
+ 			Assert(node);
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (node && ((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		default:
  			break;
  	}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 03aa20f..c22610c
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 556,561 ****
--- 556,562 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 634,640 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 635,641 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index a34dcc5..d9d0f61
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,112 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 129,139 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,132 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
  
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols,
! 											  plannode->sortColIdx,
! 											  plannode->sortOperators,
! 											  plannode->collations,
! 											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
! 		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 146,300 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
! 	if (node->tuplesortstate != NULL)
! 	{
! 		tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 		node->groupsCount++;
! 	}
! 	else
! 	{
! 		/* Support structures for cmpSortSkipCols - already sorted columns */
! 		if (skipCols)
! 			prepareSkipCols(plannode, node);
  
+ 		/*
+ 		 * Only pass on remaining columns that are unsorted.  Skip abbreviated
+ 		 * keys usage for partial sort.  We unlikely will have huge groups
+ 		 * with partial sort.  Therefore usage of abbreviated keys would be
+ 		 * likely a waste of time.
+ 		 */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
! 											  plannode->numCols - skipCols,
! 											  &(plannode->sortColIdx[skipCols]),
! 											  &(plannode->sortOperators[skipCols]),
! 											  &(plannode->collations[skipCols]),
! 											  &(plannode->nullsFirst[skipCols]),
  											  work_mem,
! 											  node->randomAccess,
! 											  skipCols > 0 ? true : false);
  		node->tuplesortstate = (void *) tuplesortstate;
+ 		node->groupsCount++;
+ 	}
  
! 	if (node->bounded)
! 		tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
  
! 	/*
! 	 * Put next group of tuples where skipCols sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else if (node->prev)
+ 		{
+ 			/* Put previous tuple into tuplesort */
+ 			ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 			tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 			nTuples++;
  
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				bool cmp;
! 				cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
! 				node->prev = ExecCopySlotTuple(slot);
! 				if (!cmp)
! 					break;
! 			}
! 		}
! 		else
! 		{
! 			if (TupIsNull(slot))
! 			{
! 				node->finished = true;
! 				break;
! 			}
! 			else
! 			{
! 				node->prev = ExecCopySlotTuple(slot);
! 			}
! 		}
! 	}
  
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
  
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 325,339 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 351,362 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
+ 	sortstate->groupsCount = 0;
+ 	sortstate->skipKeys = NULL;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 500,506 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index df7c2fa..054d117
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 830,835 ****
--- 830,836 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index eb0fc1e..7374046
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outSort(StringInfo str, const Sort *nod
*** 796,801 ****
--- 796,802 ----
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
+ 	WRITE_INT_FIELD(skipCols);
  
  	appendStringInfoString(str, " :sortColIdx");
  	for (i = 0; i < node->numCols; i++)
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index a2c2243..f188893
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readSort(void)
*** 1961,1966 ****
--- 1961,1967 ----
  	ReadCommonPlan(&local_node->plan);
  
  	READ_INT_FIELD(numCols);
+ 	READ_INT_FIELD(skipCols);
  	READ_ATTRNUMBER_ARRAY(sortColIdx, local_node->numCols);
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 5350329..2e13492
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Path *runion, Path 
*** 1412,1417 ****
--- 1412,1424 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or partial sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1438,1444 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1445,1452 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1454,1468 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1462,1483 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1492,1504 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1507,1556 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1508,1514 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1560,1566 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1519,1528 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1571,1580 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1530,1543 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1582,1607 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2282,2287 ****
--- 2346,2353 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2308,2313 ****
--- 2374,2381 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 3042,3048 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3110,3116 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan), plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
new file mode 100644
index 3b898da..6cdd6ea
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** match_unsorted_outer(PlannerInfo *root,
*** 889,895 ****
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
--- 889,895 ----
  		 * output anyway.
  		 */
  		if (enable_material && inner_cheapest_total != NULL &&
! 			!ExecMaterializesOutput(inner_cheapest_total->pathtype, NULL))
  			matpath = (Path *)
  				create_material_path(innerrel, inner_cheapest_total);
  	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 4436ac1..d60c421
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 309,314 ****
--- 310,341 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 395,406 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied partially then we would have to do partial
!  *	  sort in order to satisfy pathkeys completely.  Since partial sort
!  *	  consumes data by presorted groups, we would have to consume more data
!  *	  than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 378,409 ****
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
  
--- 409,480 ----
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
+  * 'num_groups' array of group numbers which pathkeys divide data to. Should
+  *	  be estimated using estimate_partialsort_groups().
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	double		matched_fraction;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 
+ 		if (n_pathkeys != 0 && n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Partial sort consumes data not per tuple but per presorted group.
! 		 * Increase fraction of tuples we have to read from source path by
! 		 * one presorted group.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison assuming paths could have different number
! 		 * of required pathkeys and therefore different fraction of tuples
! 		 * to fetch.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		/*
! 		 * Cheaper path with matching outer becomes a new leader.
! 		 */
! 		if (costs_cmp > 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
+ 
  	return matched_path;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1448,1456 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1519,1526 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1461,1473 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1531,1542 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 913ac84..94b01e8
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 226,232 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 226,232 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 241,250 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 241,252 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_merge_append_plan(PlannerInfo *ro
*** 1031,1036 ****
--- 1033,1039 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1065,1073 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1068,1078 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1465,1470 ****
--- 1470,1476 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1474,1480 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1480,1490 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1721,1727 ****
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
--- 1731,1738 ----
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan,
! 										 0);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3571,3578 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3582,3595 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3583,3590 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3600,3613 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4602,4608 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4625,4632 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5123,5129 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
--- 5147,5153 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
*************** make_sort(Plan *lefttree, int numCols,
*** 5135,5140 ****
--- 5159,5165 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5461,5467 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5486,5492 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5481,5487 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5506,5512 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5524,5530 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5549,5555 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5545,5551 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5570,5577 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5578,5584 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5604,5610 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index cefec7b..75f3f29
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
*************** build_minmax_path(PlannerInfo *root, Min
*** 341,346 ****
--- 342,348 ----
  	Path	   *sorted_path;
  	Cost		path_cost;
  	double		path_fraction;
+ 	double	   *psort_num_groups;
  
  	/*
  	 * We are going to construct what is effectively a sub-SELECT query, so
*************** build_minmax_path(PlannerInfo *root, Min
*** 451,461 ****
  	else
  		path_fraction = 1.0;
  
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 453,467 ----
  	else
  		path_fraction = 1.0;
  
+ 	psort_num_groups = estimate_pathkeys_groups(subroot->query_pathkeys,
+ 												subroot,
+ 												final_rel->rows);
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  psort_num_groups);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 8afac0b..13e9737
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3243,3256 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3243,3256 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_common_pathkeys;
  
! 			n_common_pathkeys = pathkeys_common(root->group_pathkeys,
! 												path->pathkeys);
! 			if (path == cheapest_path || n_common_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_common_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 3751,3763 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 3751,3763 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_common_pathkeys;
  
! 		n_common_pathkeys = pathkeys_common(root->sort_pathkeys,
! 											path->pathkeys);
! 		if (path == cheapest_input_path || n_common_pathkeys > 0)
  		{
! 			if (n_common_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 4549,4556 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget.width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 4549,4557 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget.width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 1ff4302..d7542e9
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 837,843 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 837,843 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan), plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index 6ea3319..456e8df
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 954,960 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 954,961 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 6e79800..48b23c4
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** compare_fractional_path_costs(Path *path
*** 124,129 ****
--- 124,170 ----
  }
  
  /*
+  * compare_bifractional_path_costs
+  *	  Return -1, 0, or +1 according as fetching the fraction1 tuples of path1 is
+  *	  cheaper, the same cost, or more expensive than fetching fraction2 tuples
+  *	  of path2.
+  *
+  * fraction1 and fraction2 are fractions of total tuples between 0 and 1.
+  * If fraction is <= 0 or > 1, we interpret it as 1, ie, we select the
+  * path with the cheaper total_cost.
+  */
+ 
+ /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 								double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0)
+ 		fraction1 = 1.0;
+ 	if (fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		fraction2 = 1.0;
+ 
+ 	if (fraction1 == 1.0 && fraction2 == 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * compare_path_costs_fuzzily
   *	  Compare the costs of two paths to see if either can be said to
   *	  dominate the other.
*************** create_merge_append_path(PlannerInfo *ro
*** 1278,1289 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1319,1331 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1302 ****
--- 1339,1346 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1533,1539 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1577,1584 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2240,2245 ****
--- 2285,2295 ----
  				 double limit_tuples)
  {
  	SortPath   *pathnode = makeNode(SortPath);
+ 	int			n_common_pathkeys;
+ 
+ 	n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ 
+ 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
*************** create_sort_path(PlannerInfo *root,
*** 2252,2261 ****
  		subpath->parallel_safe;
  	pathnode->path.parallel_degree = subpath->parallel_degree;
  	pathnode->path.pathkeys = pathkeys;
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2302,2314 ----
  		subpath->parallel_safe;
  	pathnode->path.parallel_degree = subpath->parallel_degree;
  	pathnode->path.pathkeys = pathkeys;
+ 	pathnode->skipCols = n_common_pathkeys;
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2524,2530 ****
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
--- 2577,2584 ----
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL, 0,
! 					  0.0,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index fe44d56..8d1717c
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 276,282 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index d396ef1..465d2f0
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3464,3469 ****
--- 3464,3505 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 67d86ed..6260e44
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 614,620 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
--- 614,621 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 662,668 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 663,669 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_end(Tuplesortstate *state)
*** 1076,1081 ****
--- 1077,1102 ----
  	MemoryContextDelete(state->sortcontext);
  }
  
+ void
+ tuplesort_reset(Tuplesortstate *state)
+ {
+ 	int i;
+ 
+ 	if (state->tapeset)
+ 		LogicalTapeSetClose(state->tapeset);
+ 
+ 	for (i = 0; i < state->memtupcount; i++)
+ 		free_sort_tuple(state, state->memtuples + i);
+ 
+ 	state->status = TSS_INITIAL;
+ 	state->memtupcount = 0;
+ 	state->boundUsed = false;
+ 	state->tapeset = NULL;
+ 	state->currentRun = 0;
+ 	state->result_tape = -1;
+ 	state->bounded = false;
+ }
+ 
  /*
   * Grow the memtuples[] array, if possible within our memory constraint.  We
   * must not exceed INT_MAX tuples in memory or the caller-provided memory
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 44fac27..5bc4d08
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** extern void ExecMarkPos(PlanState *node)
*** 106,112 ****
  extern void ExecRestrPos(PlanState *node);
  extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype);
  
  /*
   * prototypes from functions in execCurrent.c
--- 106,112 ----
  extern void ExecRestrPos(PlanState *node);
  extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
! extern bool ExecMaterializesOutput(NodeTag plantype, Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index d35ec81..0a7ba55
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1801,1806 ****
--- 1801,1813 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ typedef struct SkipKeyData
+ {
+ 	FunctionCallInfoData	fcinfo;
+ 	FmgrInfo				flinfo;
+ 	OffsetNumber			attno;
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1812,1820 ****
--- 1819,1832 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
+ 	long		groupsCount;
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index 5961f2c..e640b73
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 680,685 ****
--- 680,686 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 641728b..e7245aa
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1253,1258 ****
--- 1253,1259 ----
  {
  	Path		path;
  	Path	   *subpath;		/* path representing input source */
+ 	int			skipCols;
  } SortPath;
  
  /*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index fea2bb7..f7c0d8b
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
new file mode 100644
index 3007adb..690c566
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
*************** extern int compare_path_costs(Path *path
*** 24,29 ****
--- 24,31 ----
  				   CostSelector criterion);
  extern int compare_fractional_path_costs(Path *path1, Path *path2,
  							  double fraction);
+ extern int compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2);
  extern void set_cheapest(RelOptInfo *parent_rel);
  extern void add_path(RelOptInfo *parent_rel, Path *new_path);
  extern bool add_path_precheck(RelOptInfo *parent_rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 2fccc3a..71b2b84
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 166,178 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 166,180 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 06fbca7..3ee58ed
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 188,193 ****
--- 188,196 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5cecd6d..6476504
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 601bdb4..6f3b86b
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 898,912 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 898,915 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Sort
!          Sort Key: t1.a, t1.b, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index 89b6c1c..25ef3cd
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** ORDER BY thousand, tenthous;
*** 1359,1366 ****
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
     ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1359,1367 ----
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
     ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** ORDER BY x, y;
*** 1443,1450 ****
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
     ->  Sort
           Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1444,1452 ----
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
     ->  Sort
           Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
#80Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#78)
Re: PoC: Partial sort

Hi,

On Tue, Mar 1, 2016 at 7:06 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

I finally went over your review.

I'll respond to your points here. Note that I'm reviewing
"partial-sort-basic-7.patch", which you sent on March 13. I respond
here because this is where you answered my questions (I had no
feedback on "partial-sort-basic-6.patch", which didn't use the new
upper planner pathification stuff, unlike this latest version).

On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg@heroku.com> wrote:

Explain output
-------------------

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

I agree with your notes. In the attached version of path explain output was
revised as you proposed.

Cool.

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

With partial sort we run multiple sorts in the same node. Ideally, we need
to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple times.
I checked how it works for now.

Noticed this in nodeSort.c:

+       if (node->tuplesortstate != NULL)
+       {
+               tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+               node->groupsCount++;
+       }
+       else
+       {
+               /* Support structures for cmpSortSkipCols - already
sorted columns */
+               if (skipCols)
+                       prepareSkipCols(plannode, node);
+               /*
+                * Only pass on remaining columns that are unsorted.
Skip abbreviated
+                * keys usage for partial sort.  We unlikely will have
huge groups
+                * with partial sort.  Therefore usage of abbreviated
keys would be
+                * likely a waste of time.
+                */
                tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields. SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

In the case of subplan explain analyze gives us just information about last
subplan run. This makes me uneasy. From one side, it's probably OK that
partial sort behaves like subplan while showing information just about last
sort run. From the other side, we need some better solution for that in
general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+       /*
+        * Adjust bound_Done with number of tuples we've actually sorted.
+        */
+       if (node->bounded)
+       {
+               if (node->finished)
+                       node->bound_Done = node->bound;
+               else
+                       node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall). It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

If the sort method was very likely the same for every performsort
(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I think that this has bugs for external sorts:

+void
+tuplesort_reset(Tuplesortstate *state)
+{
+       int i;
+
+       if (state->tapeset)
+               LogicalTapeSetClose(state->tapeset);
+
+       for (i = 0; i < state->memtupcount; i++)
+               free_sort_tuple(state, state->memtuples + i);
+
+       state->status = TSS_INITIAL;
+       state->memtupcount = 0;
+       state->boundUsed = false;
+       state->tapeset = NULL;
+       state->currentRun = 0;
+       state->result_tape = -1;
+       state->bounded = false;
+}

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

I don't like this too. But I didn't find better solution without significant
rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such
rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

Optimizer
-------------

* cost_sort() needs way way more comments. Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.

Idea is that since partial sort fetches data per group then it would require
fetching more data than fully presorted path.

I think I get it.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

I've tried to make it more clear. partial_sort_path is renamed to
presorted_path.

Unique paths occasionally can use this optimization.

But it depends on attribute order. I could work out this case, but I would
prefer some simple case to commit before. I already throw merge join
optimization away for the sake of simplicity.

I think that was the right decision under our time constraints.
However, I suggest noting that this should happen for unique paths in
the future, say within create_unique_path().

Other notes:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting
fires.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81David Steele
david@pgmasters.net
In reply to: Peter Geoghegan (#80)
Re: PoC: Partial sort

Hi Alexander,

On 3/23/16 8:39 PM, Peter Geoghegan wrote:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting
fires.

It looks like a new patch is required before this can be marked "ready
for committer". Will you have that ready soon?

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Alexander Korotkov
aekorotkov@gmail.com
In reply to: David Steele (#81)
Re: PoC: Partial sort

On Tue, Mar 29, 2016 at 4:56 PM, David Steele <david@pgmasters.net> wrote:

On 3/23/16 8:39 PM, Peter Geoghegan wrote:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting
fires.

It looks like a new patch is required before this can be marked "ready for
committer". Will you have that ready soon?

Yes, that's it. I'm working on it now. I'm going to post it until
tomorrow.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#83Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#80)
1 attachment(s)
Re: PoC: Partial sort

Hi, Peter!

Thank you for review!

On Thu, Mar 24, 2016 at 3:39 AM, Peter Geoghegan <pg@heroku.com> wrote:

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

With partial sort we run multiple sorts in the same node. Ideally, we

need

to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple

times.

I checked how it works for now.

Noticed this in nodeSort.c:

+       if (node->tuplesortstate != NULL)
+       {
+               tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+               node->groupsCount++;
+       }
+       else
+       {
+               /* Support structures for cmpSortSkipCols - already
sorted columns */
+               if (skipCols)
+                       prepareSkipCols(plannode, node);
+               /*
+                * Only pass on remaining columns that are unsorted.
Skip abbreviated
+                * keys usage for partial sort.  We unlikely will have
huge groups
+                * with partial sort.  Therefore usage of abbreviated
keys would be
+                * likely a waste of time.
+                */
tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.

Done.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields.

Done.

SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

Fixed.

The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

Comment is added.

In the case of subplan explain analyze gives us just information about

last

subplan run. This makes me uneasy. From one side, it's probably OK that
partial sort behaves like subplan while showing information just about

last

sort run. From the other side, we need some better solution for that in
general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+       /*
+        * Adjust bound_Done with number of tuples we've actually sorted.
+        */
+       if (node->bounded)
+       {
+               if (node->finished)
+                       node->bound_Done = node->bound;
+               else
+                       node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall). It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort methods
is very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory. In the case of planner mistake it causes OOM. I met such
situation in production and not once. This is why I'd like partial sort to
have graceful degradation for such cases.

If the sort method was very likely the same for every performsort

(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I've adjusted EXPLAIN ANALYZE to show maximum resources consumption.

I think that this has bugs for external sorts:

+void
+tuplesort_reset(Tuplesortstate *state)
+{
+       int i;
+
+       if (state->tapeset)
+               LogicalTapeSetClose(state->tapeset);
+
+       for (i = 0; i < state->memtupcount; i++)
+               free_sort_tuple(state, state->memtuples + i);
+
+       state->status = TSS_INITIAL;
+       state->memtupcount = 0;
+       state->boundUsed = false;
+       state->tapeset = NULL;
+       state->currentRun = 0;
+       state->result_tape = -1;
+       state->bounded = false;
+}

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

I tried to manage this by introducing another memory context which is
persistent between partial sort batches. Other memory contexts are reset.

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

I don't like this too. But I didn't find better solution without

significant

rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such
rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

I've added ExecPlanMaterializesOutput() function.

Optimizer
-------------

* cost_sort() needs way way more comments. Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

I'm not sure that in partial sort we should estimate sorting of one group
in other way than simple sort does. I see following reasons:
1) I'd like partial sort to be able to use other sorting methods to provide
graceful degradation in the case of planner mistakes as I pointed above.
2) Planner should don't choose partial sort plans in some cases. That
should happen because higher cost of these plans including high cost of
partial sort. Estimation of other sort methods looks like good way for
reporting such high costs.

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

I think this is rather a typo fix. Because now comment doesn't meet
function name.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-8.patchapplication/octet-stream; name=partial-sort-basic-8.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 09c2304..7c04321
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 90,96 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 90,96 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** show_sort_keys(SortState *sortstate, Lis
*** 1786,1792 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1786,1792 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1802,1808 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1802,1808 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1826,1832 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1826,1832 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1882,1888 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1882,1888 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1939,1945 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1939,1945 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1952,1964 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1952,1965 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 1998,2006 ****
--- 1999,2011 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2148,2159 ****
--- 2153,2173 ----
  			appendStringInfoSpaces(es->str, es->indent * 2);
  			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
  							 sortMethod, spaceType, spaceUsed);
+ 			if (sortstate->skipKeys)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str, "Sort groups: %ld\n",
+ 								 sortstate->groupsCount);
+ 			}
  		}
  		else
  		{
  			ExplainPropertyText("Sort Method", sortMethod, es);
  			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
  			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			if (sortstate->skipKeys)
+ 				ExplainPropertyLong("Sort groups: %ld",
+ 									sortstate->groupsCount, es);
  		}
  	}
  }
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 0c8e939..59bd3b4
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 395,403 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 395,409 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((SortPath *)pathnode)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 511,520 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 517,532 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 567,578 ****
  }
  
  /*
!  * ExecMaterializesOutput - does a plan type materialize its output?
   *
!  * Returns true if the plan node type is one that automatically materializes
!  * its output (typically by keeping it in a tuplestore).  For such plans,
!  * a rescan without any parameter change will have zero startup cost and
!  * very low per-tuple cost.
   */
  bool
  ExecMaterializesOutput(NodeTag plantype)
--- 579,590 ----
  }
  
  /*
!  * ExecMaterializesOutput - can a plan type materialize its output?
   *
!  * Returns true if the plan node type can materialize its output. When this
!  * function returns true, it should be rechecked for Plan node itself using
!  * ExecPlanMaterializesOutput function.  It might appears that despite this
!  * plan type can materialize output, particular plan does not.
   */
  bool
  ExecMaterializesOutput(NodeTag plantype)
*************** ExecMaterializesOutput(NodeTag plantype)
*** 583,588 ****
--- 595,602 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
+ 			return true;
+ 
  		case T_Sort:
  			return true;
  
*************** ExecMaterializesOutput(NodeTag plantype)
*** 592,594 ****
--- 606,631 ----
  
  	return false;
  }
+ 
+ /*
+  * ExecPlanMaterializesOutput - does a plan materialize its output?
+  *
+  * Returns true if the plan node isautomatically materializes its output
+  * (typically by keeping it in a tuplestore).  For such plans, a rescan without
+  * any parameter change will have zero startup cost and very low per-tuple cost.
+  */
+ bool
+ ExecPlanMaterializesOutput(Plan *node)
+ {
+ 	if (node->type == T_Sort)
+ 	{
+ 		if (((Sort *)node)->skipCols == 0)
+ 			return true;
+ 		else
+ 			return false;
+ 	}
+ 	else
+ 	{
+ 		return ExecMaterializesOutput(node->type);
+ 	}
+ }
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index 614b26b..1a092a4
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 586,591 ****
--- 586,592 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 664,670 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 665,671 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index a34dcc5..2369980
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,113 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 130,140 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,87 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
! 
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
--- 147,189 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
+ 	if (skipCols == 0)
+ 	{
+ 		/* Regular case: no skip cols */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
*************** ExecSort(SortState *node)
*** 89,132 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 191,342 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 	}
! 	else
! 	{
! 		/* Partial sort case */
! 		if (node->tuplesortstate == NULL)
! 		{
! 			/*
! 			 * We are going to process the first group of presorted data.
! 			 * Initialize support structures for cmpSortSkipCols - already
! 			 * sorted columns.
! 			 */
! 			prepareSkipCols(plannode, node);
  
! 			/*
! 			 * Only pass on remaining columns that are unsorted.  Skip
! 			 * abbreviated keys usage for partial sort.  We unlikely will have
! 			 * huge groups with partial sort.  Therefore usage of abbreviated
! 			 * keys would be likely a waste of time.
! 			 */
! 			tuplesortstate = tuplesort_begin_heap(
! 										tupDesc,
! 										plannode->numCols - skipCols,
! 										&(plannode->sortColIdx[skipCols]),
! 										&(plannode->sortOperators[skipCols]),
! 										&(plannode->collations[skipCols]),
! 										&(plannode->nullsFirst[skipCols]),
! 										work_mem,
! 										false,
! 										true);
! 			node->tuplesortstate = (void *) tuplesortstate;
! 			node->groupsCount++;
! 		}
! 		else
  		{
! 			/* Next group of presorted data */
! 			tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 			node->groupsCount++;
! 		}
! 
! 		/* Calculate remaining bound for bounded sort */
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
! 	}
! 
! 	/*
! 	 * Put next group of tuples where skipCols sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
+ 			/* Regular sort case: put all tuples to the tuplesort */
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else
+ 		{
+ 			/* Partial sort case: put group of presorted data to the tuplesort */
+ 			if (!node->prev)
+ 			{
+ 				/* First tuple */
+ 				if (TupIsNull(slot))
+ 				{
+ 					node->finished = true;
+ 					break;
+ 				}
+ 				else
+ 				{
+ 					node->prev = ExecCopySlotTuple(slot);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				/* Put previous tuple into tuplesort */
+ 				ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 				tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 				nTuples++;
  
! 				if (TupIsNull(slot))
! 				{
! 					node->finished = true;
! 					break;
! 				}
! 				else
! 				{
! 					bool	cmp;
! 					cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
  
! 					/* Replace previous tuple with current one */
! 					heap_freetuple(node->prev);
! 					node->prev = ExecCopySlotTuple(slot);
  
! 					/*
! 					 * When skipCols are not equal then group of presorted data
! 					 * is finished
! 					 */
! 					if (!cmp)
! 						break;
! 				}
! 			}
! 		}
! 	}
! 
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
! 
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 367,381 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 393,404 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
+ 	sortstate->groupsCount = 0;
+ 	sortstate->skipKeys = NULL;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 542,548 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index f4e4a91..d890198
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 832,837 ****
--- 832,838 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 5b71c95..d9ce9e4
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outSort(StringInfo str, const Sort *nod
*** 797,802 ****
--- 797,803 ----
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
+ 	WRITE_INT_FIELD(skipCols);
  
  	appendStringInfoString(str, " :sortColIdx");
  	for (i = 0; i < node->numCols; i++)
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 202e90a..fc319fe
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readSort(void)
*** 1955,1960 ****
--- 1955,1961 ----
  	ReadCommonPlan(&local_node->plan);
  
  	READ_INT_FIELD(numCols);
+ 	READ_INT_FIELD(skipCols);
  	READ_ATTRNUMBER_ARRAY(sortColIdx, local_node->numCols);
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index b86fc5e..dc9f240
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Path *runion, Path 
*** 1419,1424 ****
--- 1419,1431 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or partial sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1445,1451 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1452,1459 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1461,1475 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1469,1490 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1499,1511 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1514,1563 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = (group_input_bytes / sort_mem_bytes) * 0.5;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1515,1521 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1567,1573 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1526,1535 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1578,1587 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1537,1550 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1589,1614 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2297,2302 ****
--- 2361,2368 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2323,2328 ****
--- 2389,2396 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 3057,3063 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3125,3131 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecPlanMaterializesOutput(plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 4436ac1..d60c421
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 309,314 ****
--- 310,341 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 395,406 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied partially then we would have to do partial
!  *	  sort in order to satisfy pathkeys completely.  Since partial sort
!  *	  consumes data by presorted groups, we would have to consume more data
!  *	  than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 378,409 ****
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
  
--- 409,480 ----
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
+  * 'num_groups' array of group numbers which pathkeys divide data to. Should
+  *	  be estimated using estimate_partialsort_groups().
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	double		matched_fraction;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 
+ 		if (n_pathkeys != 0 && n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Partial sort consumes data not per tuple but per presorted group.
! 		 * Increase fraction of tuples we have to read from source path by
! 		 * one presorted group.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison assuming paths could have different number
! 		 * of required pathkeys and therefore different fraction of tuples
! 		 * to fetch.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		/*
! 		 * Cheaper path with matching outer becomes a new leader.
! 		 */
! 		if (costs_cmp > 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
+ 
  	return matched_path;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1448,1456 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1519,1526 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1461,1473 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1531,1542 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 994983b..6f38a6d
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 227,233 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 227,233 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 242,251 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 242,253 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_merge_append_plan(PlannerInfo *ro
*** 1032,1037 ****
--- 1034,1040 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1066,1074 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1069,1079 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1470,1475 ****
--- 1475,1481 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1479,1485 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1485,1495 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1727,1733 ****
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
--- 1737,1744 ----
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan,
! 										 0);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3579,3586 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3590,3603 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3591,3598 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3608,3621 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4610,4616 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4633,4640 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5132,5138 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
--- 5156,5162 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
*************** make_sort(Plan *lefttree, int numCols,
*** 5144,5149 ****
--- 5168,5174 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5470,5476 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5495,5501 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5490,5496 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5515,5521 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5533,5539 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5558,5564 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5554,5560 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5579,5586 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5587,5593 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5613,5619 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index cefec7b..75f3f29
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
*************** build_minmax_path(PlannerInfo *root, Min
*** 341,346 ****
--- 342,348 ----
  	Path	   *sorted_path;
  	Cost		path_cost;
  	double		path_fraction;
+ 	double	   *psort_num_groups;
  
  	/*
  	 * We are going to construct what is effectively a sub-SELECT query, so
*************** build_minmax_path(PlannerInfo *root, Min
*** 451,461 ****
  	else
  		path_fraction = 1.0;
  
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 453,467 ----
  	else
  		path_fraction = 1.0;
  
+ 	psort_num_groups = estimate_pathkeys_groups(subroot->query_pathkeys,
+ 												subroot,
+ 												final_rel->rows);
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  psort_num_groups);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index b2a9a80..0b7453e
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3513,3526 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3513,3526 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_common_pathkeys;
  
! 			n_common_pathkeys = pathkeys_common(root->group_pathkeys,
! 												path->pathkeys);
! 			if (path == cheapest_path || n_common_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_common_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4092,4104 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4092,4104 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_common_pathkeys;
  
! 		n_common_pathkeys = pathkeys_common(root->sort_pathkeys,
! 											path->pathkeys);
! 		if (path == cheapest_input_path || n_common_pathkeys > 0)
  		{
! 			if (n_common_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 4998,5005 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 4998,5006 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 1ff4302..2bebc65
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 837,843 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 837,843 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecPlanMaterializesOutput(plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index a1ab4da..f570cb8
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 957,963 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 957,964 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index 89cae79..05ae03d
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** compare_fractional_path_costs(Path *path
*** 124,129 ****
--- 124,170 ----
  }
  
  /*
+  * compare_bifractional_path_costs
+  *	  Return -1, 0, or +1 according as fetching the fraction1 tuples of path1 is
+  *	  cheaper, the same cost, or more expensive than fetching fraction2 tuples
+  *	  of path2.
+  *
+  * fraction1 and fraction2 are fractions of total tuples between 0 and 1.
+  * If fraction is <= 0 or > 1, we interpret it as 1, ie, we select the
+  * path with the cheaper total_cost.
+  */
+ 
+ /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 								double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0)
+ 		fraction1 = 1.0;
+ 	if (fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		fraction2 = 1.0;
+ 
+ 	if (fraction1 == 1.0 && fraction2 == 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * compare_path_costs_fuzzily
   *	  Compare the costs of two paths to see if either can be said to
   *	  dominate the other.
*************** create_merge_append_path(PlannerInfo *ro
*** 1278,1289 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1319,1331 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1297,1302 ****
--- 1339,1346 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1533,1539 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1577,1584 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2275,2280 ****
--- 2320,2330 ----
  				 double limit_tuples)
  {
  	SortPath   *pathnode = makeNode(SortPath);
+ 	int			n_common_pathkeys;
+ 
+ 	n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ 
+ 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
*************** create_sort_path(PlannerInfo *root,
*** 2287,2296 ****
  		subpath->parallel_safe;
  	pathnode->path.parallel_degree = subpath->parallel_degree;
  	pathnode->path.pathkeys = pathkeys;
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2337,2349 ----
  		subpath->parallel_safe;
  	pathnode->path.parallel_degree = subpath->parallel_degree;
  	pathnode->path.pathkeys = pathkeys;
+ 	pathnode->skipCols = n_common_pathkeys;
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2567,2573 ****
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
--- 2620,2627 ----
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL, 0,
! 					  0.0,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index fe44d56..8d1717c
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 276,282 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index a6555e9..60b8e9e
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3464,3469 ****
--- 3464,3505 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index d033c95..92ad83f
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 226,231 ****
--- 226,236 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	TupSortStatus maxStatus;	/* maximum status reached between sort groups */
+ 	int64		maxMem;			/* maximum amount of memory used between
+ 								   sort groups */
+ 	bool		maxMemOnDisk;	/* is maxMem value for on-disk memory */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext;	/* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void writetup_datum(Tuplesortstat
*** 548,553 ****
--- 553,561 ----
  static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 582,602 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_MINSIZE,
  										ALLOCSET_DEFAULT_INITSIZE,
  										ALLOCSET_DEFAULT_MAXSIZE);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child content used exclusively for caller passed tuples
--- 590,621 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_MINSIZE,
  										ALLOCSET_DEFAULT_INITSIZE,
  										ALLOCSET_DEFAULT_MAXSIZE);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child content used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 615,621 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 634,640 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 633,638 ****
--- 652,658 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 673,685 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 693,706 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 721,727 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 742,748 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 752,758 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 773,779 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 843,849 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 864,870 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 916,922 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 937,943 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 953,959 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 974,980 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1064,1079 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1085,1096 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1132,1138 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1149,1242 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	memUsed;
! 	bool	memUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		memUsedOnDisk = true;
! 		memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		memUsedOnDisk = false;
! 		memUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	state->maxStatus = Max(state->maxStatus, state->status);
! 	if (memUsed > state->maxMem)
! 	{
! 		state->maxMem = memUsed;
! 		state->maxMemOnDisk = memUsedOnDisk;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->batchUsed = false;
! 	state->availMem = state->allowedMem;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3269,3295 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3373,3387 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxMemOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxMem + 1023) / 1024;
  
! 	switch (state->maxStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 44fac27..998726f
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** extern void ExecRestrPos(PlanState *node
*** 107,112 ****
--- 107,113 ----
  extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
  extern bool ExecMaterializesOutput(NodeTag plantype);
+ extern bool ExecPlanMaterializesOutput(Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index dbec07e..a43a058
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1772,1777 ****
--- 1772,1791 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1783,1791 ****
--- 1797,1810 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	long		groupsCount;	/* number of groups with equal skip keys */
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index ea8554f..10aecd1
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 672,677 ****
--- 672,678 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 641446c..d240f8b
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1248,1253 ****
--- 1248,1254 ----
  {
  	Path		path;
  	Path	   *subpath;		/* path representing input source */
+ 	int			skipCols;
  } SortPath;
  
  /*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index d4adca6..a2ffd5f
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
new file mode 100644
index acc827d..ad86882
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
*************** extern int compare_path_costs(Path *path
*** 24,29 ****
--- 24,31 ----
  				   CostSelector criterion);
  extern int compare_fractional_path_costs(Path *path1, Path *path2,
  							  double fraction);
+ extern int compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2);
  extern void set_cheapest(RelOptInfo *parent_rel);
  extern void add_path(RelOptInfo *parent_rel, Path *new_path);
  extern bool add_path_precheck(RelOptInfo *parent_rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 2fccc3a..71b2b84
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 166,178 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 166,180 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 8e0d317..06c0d7d
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 230,235 ****
--- 230,238 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5cecd6d..6476504
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 601bdb4..6f3b86b
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 898,912 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 898,915 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Sort
!          Sort Key: t1.a, t1.b, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index d8b5b1d..752f1b6
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** ORDER BY thousand, tenthous;
*** 1359,1366 ****
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
     ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1359,1367 ----
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
     ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** ORDER BY x, y;
*** 1443,1450 ****
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
     ->  Sort
           Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1444,1452 ----
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
     ->  Sort
           Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
#84Peter Geoghegan
pg@heroku.com
In reply to: Alexander Korotkov (#83)
Re: PoC: Partial sort

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort methods is
very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely in-memory.
In the case of planner mistake it causes OOM. I met such situation in
production and not once. This is why I'd like partial sort to have graceful
degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Geoghegan (#84)
1 attachment(s)
Re: PoC: Partial sort

On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort

methods is

very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely

in-memory.

In the case of planner mistake it causes OOM. I met such situation in
production and not once. This is why I'd like partial sort to have

graceful

degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

Patch was rebased to current master.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

partial-sort-basic-9.patchapplication/octet-stream; name=partial-sort-basic-9.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
new file mode 100644
index 1247433..d2955b7
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*************** static void show_grouping_set_keys(PlanS
*** 91,97 ****
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
--- 91,97 ----
  static void show_group_keys(GroupState *gstate, List *ancestors,
  				ExplainState *es);
  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es);
  static void show_sortorder_options(StringInfo buf, Node *sortexpr,
*************** show_sort_keys(SortState *sortstate, Lis
*** 1810,1816 ****
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1810,1816 ----
  	Sort	   *plan = (Sort *) sortstate->ss.ps.plan;
  
  	show_sort_group_keys((PlanState *) sortstate, "Sort Key",
! 						 plan->numCols, plan->skipCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_merge_append_keys(MergeAppendState 
*** 1826,1832 ****
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
--- 1826,1832 ----
  	MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
  
  	show_sort_group_keys((PlanState *) mstate, "Sort Key",
! 						 plan->numCols, 0, plan->sortColIdx,
  						 plan->sortOperators, plan->collations,
  						 plan->nullsFirst,
  						 ancestors, es);
*************** show_agg_keys(AggState *astate, List *an
*** 1850,1856 ****
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
--- 1850,1856 ----
  			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
  		else
  			show_sort_group_keys(outerPlanState(astate), "Group Key",
! 								 plan->numCols, 0, plan->grpColIdx,
  								 NULL, NULL, NULL,
  								 ancestors, es);
  
*************** show_grouping_set_keys(PlanState *planst
*** 1906,1912 ****
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
--- 1906,1912 ----
  	if (sortnode)
  	{
  		show_sort_group_keys(planstate, "Sort Key",
! 							 sortnode->numCols, 0, sortnode->sortColIdx,
  							 sortnode->sortOperators, sortnode->collations,
  							 sortnode->nullsFirst,
  							 ancestors, es);
*************** show_group_keys(GroupState *gstate, List
*** 1963,1969 ****
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
--- 1963,1969 ----
  	/* The key columns refer to the tlist of the child plan */
  	ancestors = lcons(gstate, ancestors);
  	show_sort_group_keys(outerPlanState(gstate), "Group Key",
! 						 plan->numCols, 0, plan->grpColIdx,
  						 NULL, NULL, NULL,
  						 ancestors, es);
  	ancestors = list_delete_first(ancestors);
*************** show_group_keys(GroupState *gstate, List
*** 1976,1988 ****
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
--- 1976,1989 ----
   */
  static void
  show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 					 int nkeys, int nPresortedKeys, AttrNumber *keycols,
  					 Oid *sortOperators, Oid *collations, bool *nullsFirst,
  					 List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate->plan;
  	List	   *context;
  	List	   *result = NIL;
+ 	List	   *resultPresorted = NIL;
  	StringInfoData sortkeybuf;
  	bool		useprefix;
  	int			keyno;
*************** show_sort_group_keys(PlanState *planstat
*** 2022,2030 ****
--- 2023,2035 ----
  								   nullsFirst[keyno]);
  		/* Emit one property-list item per sort key */
  		result = lappend(result, pstrdup(sortkeybuf.data));
+ 		if (keyno < nPresortedKeys)
+ 			resultPresorted = lappend(resultPresorted, exprstr);
  	}
  
  	ExplainPropertyList(qlabel, result, es);
+ 	if (nPresortedKeys > 0)
+ 		ExplainPropertyList("Presorted Key", resultPresorted, es);
  }
  
  /*
*************** show_sort_info(SortState *sortstate, Exp
*** 2172,2183 ****
--- 2177,2197 ----
  			appendStringInfoSpaces(es->str, es->indent * 2);
  			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
  							 sortMethod, spaceType, spaceUsed);
+ 			if (sortstate->skipKeys)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfo(es->str, "Sort groups: %ld\n",
+ 								 sortstate->groupsCount);
+ 			}
  		}
  		else
  		{
  			ExplainPropertyText("Sort Method", sortMethod, es);
  			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
  			ExplainPropertyText("Sort Space Type", spaceType, es);
+ 			if (sortstate->skipKeys)
+ 				ExplainPropertyLong("Sort groups: %ld",
+ 									sortstate->groupsCount, es);
  		}
  	}
  }
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
new file mode 100644
index 2587ef7..4e7c222
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
*************** ExecSupportsMarkRestore(Path *pathnode)
*** 395,403 ****
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
- 		case T_Sort:
  			return true;
  
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
--- 395,409 ----
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_Material:
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((SortPath *)pathnode)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_CustomScan:
  			Assert(IsA(pathnode, CustomPath));
  			if (((CustomPath *) pathnode)->flags & CUSTOMPATH_SUPPORT_MARK_RESTORE)
*************** ExecSupportsBackwardScan(Plan *node)
*** 510,519 ****
  			return false;
  
  		case T_Material:
- 		case T_Sort:
  			/* these don't evaluate tlist */
  			return true;
  
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
--- 516,531 ----
  			return false;
  
  		case T_Material:
  			/* these don't evaluate tlist */
  			return true;
  
+ 		case T_Sort:
+ 			/* With skipCols sort node holds only last bucket */
+ 			if (((Sort *)node)->skipCols == 0)
+ 				return true;
+ 			else
+ 				return false;
+ 
  		case T_LockRows:
  		case T_Limit:
  			/* these don't evaluate tlist */
*************** IndexSupportsBackwardScan(Oid indexid)
*** 566,577 ****
  }
  
  /*
!  * ExecMaterializesOutput - does a plan type materialize its output?
   *
!  * Returns true if the plan node type is one that automatically materializes
!  * its output (typically by keeping it in a tuplestore).  For such plans,
!  * a rescan without any parameter change will have zero startup cost and
!  * very low per-tuple cost.
   */
  bool
  ExecMaterializesOutput(NodeTag plantype)
--- 578,589 ----
  }
  
  /*
!  * ExecMaterializesOutput - can a plan type materialize its output?
   *
!  * Returns true if the plan node type can materialize its output. When this
!  * function returns true, it should be rechecked for Plan node itself using
!  * ExecPlanMaterializesOutput function.  It might appears that despite this
!  * plan type can materialize output, particular plan does not.
   */
  bool
  ExecMaterializesOutput(NodeTag plantype)
*************** ExecMaterializesOutput(NodeTag plantype)
*** 582,587 ****
--- 594,601 ----
  		case T_FunctionScan:
  		case T_CteScan:
  		case T_WorkTableScan:
+ 			return true;
+ 
  		case T_Sort:
  			return true;
  
*************** ExecMaterializesOutput(NodeTag plantype)
*** 591,593 ****
--- 605,630 ----
  
  	return false;
  }
+ 
+ /*
+  * ExecPlanMaterializesOutput - does a plan materialize its output?
+  *
+  * Returns true if the plan node isautomatically materializes its output
+  * (typically by keeping it in a tuplestore).  For such plans, a rescan without
+  * any parameter change will have zero startup cost and very low per-tuple cost.
+  */
+ bool
+ ExecPlanMaterializesOutput(Plan *node)
+ {
+ 	if (node->type == T_Sort)
+ 	{
+ 		if (((Sort *)node)->skipCols == 0)
+ 			return true;
+ 		else
+ 			return false;
+ 	}
+ 	else
+ 	{
+ 		return ExecMaterializesOutput(node->type);
+ 	}
+ }
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
new file mode 100644
index ce2fc28..d2c04e5
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** initialize_phase(AggState *aggstate, int
*** 567,572 ****
--- 567,573 ----
  												  sortnode->collations,
  												  sortnode->nullsFirst,
  												  work_mem,
+ 												  false,
  												  false);
  	}
  
*************** initialize_aggregate(AggState *aggstate,
*** 645,651 ****
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false);
  	}
  
  	/*
--- 646,652 ----
  									 pertrans->sortOperators,
  									 pertrans->sortCollations,
  									 pertrans->sortNullsFirst,
! 									 work_mem, false, false);
  	}
  
  	/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
new file mode 100644
index a34dcc5..2369980
*** a/src/backend/executor/nodeSort.c
--- b/src/backend/executor/nodeSort.c
***************
*** 15,25 ****
--- 15,113 ----
  
  #include "postgres.h"
  
+ #include "access/htup_details.h"
  #include "executor/execdebug.h"
  #include "executor/nodeSort.h"
  #include "miscadmin.h"
+ #include "utils/lsyscache.h"
  #include "utils/tuplesort.h"
  
+ /*
+  * Check if first "skipCols" sort values are equal.
+  */
+ static bool
+ cmpSortSkipCols(SortState *node, TupleDesc tupDesc, HeapTuple a, TupleTableSlot *b)
+ {
+ 	int n = ((Sort *)node->ss.ps.plan)->skipCols, i;
+ 
+ 	for (i = 0; i < n; i++)
+ 	{
+ 		Datum datumA, datumB, result;
+ 		bool isnullA, isnullB;
+ 		AttrNumber attno = node->skipKeys[i].attno;
+ 		SkipKeyData *key;
+ 
+ 		datumA = heap_getattr(a, attno, tupDesc, &isnullA);
+ 		datumB = slot_getattr(b, attno, &isnullB);
+ 
+ 		/* Special case for NULL-vs-NULL, else use standard comparison */
+ 		if (isnullA || isnullB)
+ 		{
+ 			if (isnullA == isnullB)
+ 				continue;
+ 			else
+ 				return false;
+ 		}
+ 
+ 		key = &node->skipKeys[i];
+ 
+ 		key->fcinfo.arg[0] = datumA;
+ 		key->fcinfo.arg[1] = datumB;
+ 
+ 		/* just for paranoia's sake, we reset isnull each time */
+ 		key->fcinfo.isnull = false;
+ 
+ 		result = FunctionCallInvoke(&key->fcinfo);
+ 
+ 		/* Check for null result, since caller is clearly not expecting one */
+ 		if (key->fcinfo.isnull)
+ 			elog(ERROR, "function %u returned NULL", key->flinfo.fn_oid);
+ 
+ 		if (!DatumGetBool(result))
+ 			return false;
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Prepare information for skipKeys comparison.
+  */
+ static void
+ prepareSkipCols(Sort *plannode, SortState *node)
+ {
+ 	int skipCols = plannode->skipCols, i;
+ 
+ 	node->skipKeys = (SkipKeyData *)palloc(skipCols * sizeof(SkipKeyData));
+ 
+ 	for (i = 0; i < skipCols; i++)
+ 	{
+ 		Oid equalityOp, equalityFunc;
+ 		SkipKeyData *key;
+ 
+ 		key = &node->skipKeys[i];
+ 		key->attno = plannode->sortColIdx[i];
+ 
+ 		equalityOp = get_equality_op_for_ordering_op(
+ 											plannode->sortOperators[i], NULL);
+ 		if (!OidIsValid(equalityOp))
+ 			elog(ERROR, "missing equality operator for ordering operator %u",
+ 					plannode->sortOperators[i]);
+ 
+ 		equalityFunc = get_opcode(equalityOp);
+ 		if (!OidIsValid(equalityFunc))
+ 			elog(ERROR, "missing function for operator %u", equalityOp);
+ 
+ 		/* Lookup the comparison function */
+ 		fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+ 
+ 		/* We can initialize the callinfo just once and re-use it */
+ 		InitFunctionCallInfoData(key->fcinfo, &key->flinfo, 2,
+ 								plannode->collations[i], NULL, NULL);
+ 		key->fcinfo.argnull[0] = false;
+ 		key->fcinfo.argnull[1] = false;
+ 	}
+ }
+ 
  
  /* ----------------------------------------------------------------
   *		ExecSort
*************** ExecSort(SortState *node)
*** 42,47 ****
--- 130,140 ----
  	ScanDirection dir;
  	Tuplesortstate *tuplesortstate;
  	TupleTableSlot *slot;
+ 	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+ 	PlanState  *outerNode;
+ 	TupleDesc	tupDesc;
+ 	int			skipCols = plannode->skipCols;
+ 	int64		nTuples = 0;
  
  	/*
  	 * get state info from node
*************** ExecSort(SortState *node)
*** 54,87 ****
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	if (!node->sort_Done)
! 	{
! 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
! 		PlanState  *outerNode;
! 		TupleDesc	tupDesc;
! 
! 		SO1_printf("ExecSort: %s\n",
! 				   "sorting subplan");
  
! 		/*
! 		 * Want to scan subplan in the forward direction while creating the
! 		 * sorted data.
! 		 */
! 		estate->es_direction = ForwardScanDirection;
  
! 		/*
! 		 * Initialize tuplesort module.
! 		 */
! 		SO1_printf("ExecSort: %s\n",
! 				   "calling tuplesort_begin");
  
! 		outerNode = outerPlanState(node);
! 		tupDesc = ExecGetResultType(outerNode);
  
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
--- 147,189 ----
  	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
  
  	/*
+ 	 * Return next tuple from sorted set if any.
+ 	 */
+ 	if (node->sort_Done)
+ 	{
+ 		slot = node->ss.ps.ps_ResultTupleSlot;
+ 		if (tuplesort_gettupleslot(tuplesortstate,
+ 									  ScanDirectionIsForward(dir),
+ 									  slot, NULL) || node->finished)
+ 			return slot;
+ 	}
+ 
+ 	/*
  	 * If first time through, read all tuples from outer plan and pass them to
  	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
  	 */
  
! 	SO1_printf("ExecSort: %s\n",
! 			   "sorting subplan");
  
! 	/*
! 	 * Want to scan subplan in the forward direction while creating the
! 	 * sorted data.
! 	 */
! 	estate->es_direction = ForwardScanDirection;
  
! 	/*
! 	 * Initialize tuplesort module.
! 	 */
! 	SO1_printf("ExecSort: %s\n",
! 			   "calling tuplesort_begin");
  
! 	outerNode = outerPlanState(node);
! 	tupDesc = ExecGetResultType(outerNode);
  
+ 	if (skipCols == 0)
+ 	{
+ 		/* Regular case: no skip cols */
  		tuplesortstate = tuplesort_begin_heap(tupDesc,
  											  plannode->numCols,
  											  plannode->sortColIdx,
*************** ExecSort(SortState *node)
*** 89,132 ****
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess);
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		/*
! 		 * Scan the subplan and feed all the tuples to tuplesort.
! 		 */
  
! 		for (;;)
  		{
! 			slot = ExecProcNode(outerNode);
  
  			if (TupIsNull(slot))
  				break;
! 
  			tuplesort_puttupleslot(tuplesortstate, slot);
  		}
  
! 		/*
! 		 * Complete the sort.
! 		 */
! 		tuplesort_performsort(tuplesortstate);
  
! 		/*
! 		 * restore to user specified direction
! 		 */
! 		estate->es_direction = dir;
  
! 		/*
! 		 * finally set the sorted flag to true
! 		 */
! 		node->sort_Done = true;
! 		node->bounded_Done = node->bounded;
! 		node->bound_Done = node->bound;
! 		SO1_printf("ExecSort: %s\n", "sorting done");
  	}
  
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
--- 191,342 ----
  											  plannode->collations,
  											  plannode->nullsFirst,
  											  work_mem,
! 											  node->randomAccess,
! 											  false);
  		node->tuplesortstate = (void *) tuplesortstate;
  
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound);
! 	}
! 	else
! 	{
! 		/* Partial sort case */
! 		if (node->tuplesortstate == NULL)
! 		{
! 			/*
! 			 * We are going to process the first group of presorted data.
! 			 * Initialize support structures for cmpSortSkipCols - already
! 			 * sorted columns.
! 			 */
! 			prepareSkipCols(plannode, node);
  
! 			/*
! 			 * Only pass on remaining columns that are unsorted.  Skip
! 			 * abbreviated keys usage for partial sort.  We unlikely will have
! 			 * huge groups with partial sort.  Therefore usage of abbreviated
! 			 * keys would be likely a waste of time.
! 			 */
! 			tuplesortstate = tuplesort_begin_heap(
! 										tupDesc,
! 										plannode->numCols - skipCols,
! 										&(plannode->sortColIdx[skipCols]),
! 										&(plannode->sortOperators[skipCols]),
! 										&(plannode->collations[skipCols]),
! 										&(plannode->nullsFirst[skipCols]),
! 										work_mem,
! 										false,
! 										true);
! 			node->tuplesortstate = (void *) tuplesortstate;
! 			node->groupsCount++;
! 		}
! 		else
  		{
! 			/* Next group of presorted data */
! 			tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
! 			node->groupsCount++;
! 		}
! 
! 		/* Calculate remaining bound for bounded sort */
! 		if (node->bounded)
! 			tuplesort_set_bound(tuplesortstate, node->bound - node->bound_Done);
! 	}
! 
! 	/*
! 	 * Put next group of tuples where skipCols sort values are equal to
! 	 * tuplesort.
! 	 */
! 	for (;;)
! 	{
! 		slot = ExecProcNode(outerNode);
  
+ 		if (skipCols == 0)
+ 		{
+ 			/* Regular sort case: put all tuples to the tuplesort */
  			if (TupIsNull(slot))
+ 			{
+ 				node->finished = true;
  				break;
! 			}
  			tuplesort_puttupleslot(tuplesortstate, slot);
+ 			nTuples++;
  		}
+ 		else
+ 		{
+ 			/* Partial sort case: put group of presorted data to the tuplesort */
+ 			if (!node->prev)
+ 			{
+ 				/* First tuple */
+ 				if (TupIsNull(slot))
+ 				{
+ 					node->finished = true;
+ 					break;
+ 				}
+ 				else
+ 				{
+ 					node->prev = ExecCopySlotTuple(slot);
+ 				}
+ 			}
+ 			else
+ 			{
+ 				/* Put previous tuple into tuplesort */
+ 				ExecStoreTuple(node->prev, node->ss.ps.ps_ResultTupleSlot, InvalidBuffer, false);
+ 				tuplesort_puttupleslot(tuplesortstate, node->ss.ps.ps_ResultTupleSlot);
+ 				nTuples++;
  
! 				if (TupIsNull(slot))
! 				{
! 					node->finished = true;
! 					break;
! 				}
! 				else
! 				{
! 					bool	cmp;
! 					cmp = cmpSortSkipCols(node, tupDesc, node->prev, slot);
  
! 					/* Replace previous tuple with current one */
! 					heap_freetuple(node->prev);
! 					node->prev = ExecCopySlotTuple(slot);
  
! 					/*
! 					 * When skipCols are not equal then group of presorted data
! 					 * is finished
! 					 */
! 					if (!cmp)
! 						break;
! 				}
! 			}
! 		}
! 	}
! 
! 	/*
! 	 * Complete the sort.
! 	 */
! 	tuplesort_performsort(tuplesortstate);
! 
! 	/*
! 	 * restore to user specified direction
! 	 */
! 	estate->es_direction = dir;
! 
! 	/*
! 	 * finally set the sorted flag to true
! 	 */
! 	node->sort_Done = true;
! 	node->bounded_Done = node->bounded;
! 
! 	/*
! 	 * Adjust bound_Done with number of tuples we've actually sorted.
! 	 */
! 	if (node->bounded)
! 	{
! 		if (node->finished)
! 			node->bound_Done = node->bound;
! 		else
! 			node->bound_Done = Min(node->bound, node->bound_Done + nTuples);
  	}
  
+ 	SO1_printf("ExecSort: %s\n", "sorting done");
+ 
  	SO1_printf("ExecSort: %s\n",
  			   "retrieving tuple from tuplesort");
  
*************** ExecInitSort(Sort *node, EState *estate,
*** 157,162 ****
--- 367,381 ----
  			   "initializing sort node");
  
  	/*
+ 	 * skipCols can't be used with either EXEC_FLAG_REWIND, EXEC_FLAG_BACKWARD
+ 	 * or EXEC_FLAG_MARK, because we hold only current bucket in
+ 	 * tuplesortstate.
+ 	 */
+ 	Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+ 											 EXEC_FLAG_BACKWARD |
+ 											 EXEC_FLAG_MARK)) == 0);
+ 
+ 	/*
  	 * create state structure
  	 */
  	sortstate = makeNode(SortState);
*************** ExecInitSort(Sort *node, EState *estate,
*** 174,180 ****
--- 393,404 ----
  
  	sortstate->bounded = false;
  	sortstate->sort_Done = false;
+ 	sortstate->finished = false;
  	sortstate->tuplesortstate = NULL;
+ 	sortstate->prev = NULL;
+ 	sortstate->bound_Done = 0;
+ 	sortstate->groupsCount = 0;
+ 	sortstate->skipKeys = NULL;
  
  	/*
  	 * Miscellaneous initialization
*************** ExecReScanSort(SortState *node)
*** 318,323 ****
--- 542,548 ----
  		node->sort_Done = false;
  		tuplesort_end((Tuplesortstate *) node->tuplesortstate);
  		node->tuplesortstate = NULL;
+ 		node->bound_Done = 0;
  
  		/*
  		 * if chgParam of subnode is not null then plan will be re-scanned by
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
new file mode 100644
index 4f39dad..f8e1596
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*************** _copySort(const Sort *from)
*** 832,837 ****
--- 832,838 ----
  	CopyPlanFields((const Plan *) from, (Plan *) newnode);
  
  	COPY_SCALAR_FIELD(numCols);
+ 	COPY_SCALAR_FIELD(skipCols);
  	COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
  	COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
  	COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
new file mode 100644
index 90fecb1..24b5993
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*************** _outSort(StringInfo str, const Sort *nod
*** 794,799 ****
--- 794,800 ----
  	_outPlanInfo(str, (const Plan *) node);
  
  	WRITE_INT_FIELD(numCols);
+ 	WRITE_INT_FIELD(skipCols);
  
  	appendStringInfoString(str, " :sortColIdx");
  	for (i = 0; i < node->numCols; i++)
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
new file mode 100644
index 894a48f..e90a7d4
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*************** _readSort(void)
*** 1967,1972 ****
--- 1967,1973 ----
  	ReadCommonPlan(&local_node->plan);
  
  	READ_INT_FIELD(numCols);
+ 	READ_INT_FIELD(skipCols);
  	READ_ATTRNUMBER_ARRAY(sortColIdx, local_node->numCols);
  	READ_OID_ARRAY(sortOperators, local_node->numCols);
  	READ_OID_ARRAY(collations, local_node->numCols);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
new file mode 100644
index 2a49639..30d7327
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*************** cost_recursive_union(Path *runion, Path 
*** 1429,1434 ****
--- 1429,1441 ----
   *	  Determines and returns the cost of sorting a relation, including
   *	  the cost of reading the input data.
   *
+  * Sort could be either full sort of relation or partial sort when we already
+  * have data presorted by some of required pathkeys.  In the second case
+  * we estimate number of groups which source data is divided to by presorted
+  * pathkeys.  And then estimate cost of sorting each individual group assuming
+  * data is divided into group uniformly.  Also, if LIMIT is specified then
+  * we have to pull from source and sort only some of total groups.
+  *
   * If the total volume of data to sort is less than sort_mem, we will do
   * an in-memory sort, which requires no I/O and about t*log2(t) tuple
   * comparisons for t tuples.
*************** cost_recursive_union(Path *runion, Path 
*** 1455,1461 ****
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
--- 1462,1469 ----
   * work that has to be done to prepare the inputs to the comparison operators.
   *
   * 'pathkeys' is a list of sort keys
!  * 'input_startup_cost' is the startup cost for reading the input data
!  * 'input_total_cost' is the total cost for reading the input data
   * 'tuples' is the number of tuples in the relation
   * 'width' is the average tuple width in bytes
   * 'comparison_cost' is the extra cost per comparison, if any
*************** cost_recursive_union(Path *runion, Path 
*** 1471,1485 ****
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_cost;
! 	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
--- 1479,1500 ----
   */
  void
  cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples)
  {
! 	Cost		startup_cost = input_startup_cost;
! 	Cost		run_cost = 0,
! 				rest_cost,
! 				group_cost,
! 				input_run_cost = input_total_cost - input_startup_cost;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
+ 	double		num_groups,
+ 				group_input_bytes,
+ 				group_tuples;
  	long		sort_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1509,1521 ****
  		output_bytes = input_bytes;
  	}
  
! 	if (output_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(input_bytes / BLCKSZ);
! 		double		nruns = input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
--- 1524,1573 ----
  		output_bytes = input_bytes;
  	}
  
! 	/*
! 	 * Estimate number of groups which dataset is divided by presorted keys.
! 	 */
! 	if (presorted_keys > 0)
! 	{
! 		List	   *presortedExprs = NIL;
! 		ListCell   *l;
! 		int			i = 0;
! 
! 		/* Extract presorted keys as list of expressions */
! 		foreach(l, pathkeys)
! 		{
! 			PathKey *key = (PathKey *)lfirst(l);
! 			EquivalenceMember *member = (EquivalenceMember *)
! 								lfirst(list_head(key->pk_eclass->ec_members));
! 
! 			presortedExprs = lappend(presortedExprs, member->em_expr);
! 
! 			i++;
! 			if (i >= presorted_keys)
! 				break;
! 		}
! 
! 		/* Estimate number of groups with equal presorted keys */
! 		num_groups = estimate_num_groups(root, presortedExprs, tuples, NULL);
! 	}
! 	else
! 	{
! 		num_groups = 1.0;
! 	}
! 
! 	/*
! 	 * Estimate average cost of sorting of one group where presorted keys are
! 	 * equal.
! 	 */
! 	group_input_bytes = input_bytes / num_groups;
! 	group_tuples = tuples / num_groups;
! 	if (output_bytes > sort_mem_bytes && group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll have to use a disk-based sort of all the tuples
  		 */
! 		double		npages = ceil(group_input_bytes / BLCKSZ);
! 		double		nruns = group_input_bytes / sort_mem_bytes;
  		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
  		double		log_runs;
  		double		npageaccesses;
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1525,1531 ****
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  
  		/* Disk costs */
  
--- 1577,1583 ----
  		 *
  		 * Assume about N log2 N comparisons
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  
  		/* Disk costs */
  
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1536,1545 ****
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		startup_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
--- 1588,1597 ----
  			log_runs = 1.0;
  		npageaccesses = 2.0 * npages * log_runs;
  		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
! 		group_cost += npageaccesses *
  			(seq_page_cost * 0.75 + random_page_cost * 0.25);
  	}
! 	else if (group_tuples > 2 * output_tuples || group_input_bytes > sort_mem_bytes)
  	{
  		/*
  		 * We'll use a bounded heap-sort keeping just K tuples in memory, for
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1547,1560 ****
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		startup_cost += comparison_cost * tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		startup_cost += comparison_cost * tuples * LOG2(tuples);
  	}
  
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
--- 1599,1624 ----
  		 * factor is a bit higher than for quicksort.  Tweak it so that the
  		 * cost curve is continuous at the crossover point.
  		 */
! 		group_cost = comparison_cost * group_tuples * LOG2(2.0 * output_tuples);
  	}
  	else
  	{
  		/* We'll use plain quicksort on all the input tuples */
! 		group_cost = comparison_cost * group_tuples * LOG2(group_tuples);
  	}
  
+ 	/* Add per group cost of fetching tuples from input */
+ 	group_cost += input_run_cost / num_groups;
+ 
+ 	/*
+ 	 * We've to sort first group to start output from node. Sorting rest of
+ 	 * groups are required to return all the other tuples.
+ 	 */
+ 	startup_cost += group_cost;
+ 	rest_cost = (num_groups * (output_tuples / tuples) - 1.0) * group_cost;
+ 	if (rest_cost > 0.0)
+ 		run_cost += rest_cost;
+ 
  	/*
  	 * Also charge a small amount (arbitrarily set equal to operator cost) per
  	 * extracted tuple.  We don't charge cpu_tuple_cost because a Sort node
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2307,2312 ****
--- 2371,2378 ----
  		cost_sort(&sort_path,
  				  root,
  				  outersortkeys,
+ 				  pathkeys_common(outer_path->pathkeys, outersortkeys),
+ 				  outer_path->startup_cost,
  				  outer_path->total_cost,
  				  outer_path_rows,
  				  outer_path->pathtarget->width,
*************** initial_cost_mergejoin(PlannerInfo *root
*** 2333,2338 ****
--- 2399,2406 ----
  		cost_sort(&sort_path,
  				  root,
  				  innersortkeys,
+ 				  pathkeys_common(inner_path->pathkeys, innersortkeys),
+ 				  inner_path->startup_cost,
  				  inner_path->total_cost,
  				  inner_path_rows,
  				  inner_path->pathtarget->width,
*************** cost_subplan(PlannerInfo *root, SubPlan 
*** 3067,3073 ****
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecMaterializesOutput(nodeTag(plan)))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
--- 3135,3141 ----
  		 * every time.
  		 */
  		if (subplan->parParam == NIL &&
! 			ExecPlanMaterializesOutput(plan))
  			sp_cost.startup += plan->startup_cost;
  		else
  			sp_cost.per_tuple += plan->startup_cost;
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
new file mode 100644
index 4436ac1..d60c421
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
***************
*** 26,31 ****
--- 26,32 ----
  #include "optimizer/paths.h"
  #include "optimizer/tlist.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  
  
  static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
*************** compare_pathkeys(List *keys1, List *keys
*** 309,314 ****
--- 310,341 ----
  }
  
  /*
+  * pathkeys_common
+  *    Returns length of longest common prefix of keys1 and keys2.
+  */
+ int
+ pathkeys_common(List *keys1, List *keys2)
+ {
+ 	int n;
+ 	ListCell   *key1,
+ 			   *key2;
+ 	n = 0;
+ 
+ 	forboth(key1, keys1, key2, keys2)
+ 	{
+ 		PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+ 		PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+ 
+ 		if (pathkey1 != pathkey2)
+ 			return n;
+ 		n++;
+ 	}
+ 
+ 	return n;
+ }
+ 
+ 
+ /*
   * pathkeys_contained_in
   *	  Common special case of compare_pathkeys: we just want to know
   *	  if keys2 are at least as well sorted as keys1.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 368,375 ****
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies the given pathkeys and parameterization.
!  *	  Return NULL if no such path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
--- 395,406 ----
  /*
   * get_cheapest_fractional_path_for_pathkeys
   *	  Find the cheapest path (for retrieving a specified fraction of all
!  *	  the tuples) that satisfies given parameterization and at least partially
!  *	  satisfies the given pathkeys.  Return NULL if no path found.
!  *	  If pathkeys are satisfied partially then we would have to do partial
!  *	  sort in order to satisfy pathkeys completely.  Since partial sort
!  *	  consumes data by presorted groups, we would have to consume more data
!  *	  than in the case of fully presorted path.
   *
   * See compare_fractional_path_costs() for the interpretation of the fraction
   * parameter.
*************** get_cheapest_path_for_pathkeys(List *pat
*** 378,409 ****
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction)
  {
  	Path	   *matched_path = NULL;
  	ListCell   *l;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
  
  		/*
! 		 * Since cost comparison is a lot cheaper than pathkey comparison, do
! 		 * that first.  (XXX is that still true?)
  		 */
! 		if (matched_path != NULL &&
! 			compare_fractional_path_costs(matched_path, path, fraction) <= 0)
! 			continue;
  
! 		if (pathkeys_contained_in(pathkeys, path->pathkeys) &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
  			matched_path = path;
  	}
  	return matched_path;
  }
  
--- 409,480 ----
   * 'pathkeys' represents a required ordering (in canonical form!)
   * 'required_outer' denotes allowable outer relations for parameterized paths
   * 'fraction' is the fraction of the total tuples expected to be retrieved
+  * 'num_groups' array of group numbers which pathkeys divide data to. Should
+  *	  be estimated using estimate_partialsort_groups().
   */
  Path *
  get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups)
  {
  	Path	   *matched_path = NULL;
+ 	int			matched_n_common_pathkeys = 0,
+ 				costs_cmp, n_common_pathkeys,
+ 				n_pathkeys = list_length(pathkeys);
  	ListCell   *l;
+ 	double		matched_fraction;
  
  	foreach(l, paths)
  	{
  		Path	   *path = (Path *) lfirst(l);
+ 		double		current_fraction;
+ 
+ 		n_common_pathkeys = pathkeys_common(pathkeys, path->pathkeys);
+ 
+ 		if (n_pathkeys != 0 && n_common_pathkeys == 0)
+ 			continue;
  
  		/*
! 		 * Partial sort consumes data not per tuple but per presorted group.
! 		 * Increase fraction of tuples we have to read from source path by
! 		 * one presorted group.
  		 */
! 		current_fraction = fraction;
! 		if (n_common_pathkeys < n_pathkeys)
! 		{
! 			current_fraction += 1.0 / num_groups[n_common_pathkeys - 1];
! 			current_fraction = Min(current_fraction, 1.0);
! 		}
  
! 		/*
! 		 * Do cost comparison assuming paths could have different number
! 		 * of required pathkeys and therefore different fraction of tuples
! 		 * to fetch.
! 		 */
! 		if (matched_path != NULL)
! 		{
! 			costs_cmp = compare_bifractional_path_costs(matched_path, path,
! 					matched_fraction, current_fraction);
! 		}
! 		else
! 		{
! 			costs_cmp = 1;
! 		}
! 
! 		/*
! 		 * Cheaper path with matching outer becomes a new leader.
! 		 */
! 		if (costs_cmp > 0 &&
  			bms_is_subset(PATH_REQ_OUTER(path), required_outer))
+ 		{
  			matched_path = path;
+ 			matched_n_common_pathkeys = n_common_pathkeys;
+ 			matched_fraction = current_fraction;
+ 		}
  	}
+ 
  	return matched_path;
  }
  
*************** right_merge_direction(PlannerInfo *root,
*** 1448,1456 ****
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Unlike merge pathkeys, this is an all-or-nothing affair: it does us
!  * no good to order by just the first key(s) of the requested ordering.
!  * So the result is always either 0 or list_length(root->query_pathkeys).
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
--- 1519,1526 ----
   *		Count the number of pathkeys that are useful for meeting the
   *		query's requested output ordering.
   *
!  * Returns number of pathkeys that maches given argument. Others can be
!  * satisfied by partial sort.
   */
  static int
  pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
*************** pathkeys_useful_for_ordering(PlannerInfo
*** 1461,1473 ****
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
! 	{
! 		/* It's useful ... or at least the first N keys are */
! 		return list_length(root->query_pathkeys);
! 	}
! 
! 	return 0;					/* path ordering not useful */
  }
  
  /*
--- 1531,1542 ----
  	if (pathkeys == NIL)
  		return 0;				/* unordered path */
  
! 	/*
! 	 * Return the number of path keys in common, or 0 if there are none. Any
! 	 * first common pathkeys could be useful for ordering because we can use
! 	 * partial sort.
! 	 */
! 	return pathkeys_common(root->query_pathkeys, pathkeys);
  }
  
  /*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 47158f6..4421615
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*************** static MergeJoin *make_mergejoin(List *t
*** 226,232 ****
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
--- 226,232 ----
  			   bool *mergenullsfirst,
  			   Plan *lefttree, Plan *righttree,
  			   JoinType jointype);
! static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst);
  static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
*************** static Plan *prepare_sort_from_pathkeys(
*** 241,250 ****
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
--- 241,252 ----
  static EquivalenceMember *find_ec_member_for_tle(EquivalenceClass *ec,
  					   TargetEntry *tle,
  					   Relids relids);
! static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
! 						 int skipCols);
  static Sort *make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols);
  static Material *make_material(Plan *lefttree);
  static WindowAgg *make_windowagg(List *tlist, Index winref,
  			   int partNumCols, AttrNumber *partColIdx, Oid *partOperators,
*************** create_merge_append_plan(PlannerInfo *ro
*** 1062,1067 ****
--- 1064,1070 ----
  		Oid		   *sortOperators;
  		Oid		   *collations;
  		bool	   *nullsFirst;
+ 		int			n_common_pathkeys;
  
  		/* Build the child plan */
  		/* Must insist that all children return the same tlist */
*************** create_merge_append_plan(PlannerInfo *ro
*** 1096,1104 ****
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
--- 1099,1109 ----
  					  numsortkeys * sizeof(bool)) == 0);
  
  		/* Now, insert a Sort node if subplan isn't sufficiently ordered */
! 		n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
! 		if (n_common_pathkeys < list_length(pathkeys))
  		{
  			Sort	   *sort = make_sort(subplan, numsortkeys,
+ 										 n_common_pathkeys,
  										 sortColIdx, sortOperators,
  										 collations, nullsFirst);
  
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1504,1509 ****
--- 1509,1515 ----
  {
  	Sort	   *plan;
  	Plan	   *subplan;
+ 	int			n_common_pathkeys;
  
  	/*
  	 * We don't want any excess columns in the sorted tuples, so request a
*************** create_sort_plan(PlannerInfo *root, Sort
*** 1513,1519 ****
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
--- 1519,1529 ----
  	subplan = create_plan_recurse(root, best_path->subpath,
  								  flags | CP_SMALL_TLIST);
  
! 	n_common_pathkeys = pathkeys_common(best_path->path.pathkeys,
! 										best_path->subpath->pathkeys);
! 
! 	plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
! 								   n_common_pathkeys);
  
  	copy_generic_path_info(&plan->plan, (Path *) best_path);
  
*************** create_groupingsets_plan(PlannerInfo *ro
*** 1759,1765 ****
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
--- 1769,1776 ----
  			sort_plan = (Plan *)
  				make_sort_from_groupcols(groupClause,
  										 new_grpColIdx,
! 										 subplan,
! 										 0);
  
  			agg_plan = (Plan *) make_agg(NIL,
  										 NIL,
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3586,3593 ****
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(outer_plan,
! 												   best_path->outersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
--- 3597,3610 ----
  	 */
  	if (best_path->outersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->outersortkeys,
! 									best_path->jpath.outerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(outer_plan, best_path->outersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		outer_plan = (Plan *) sort;
*************** create_mergejoin_plan(PlannerInfo *root,
*** 3598,3605 ****
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort = make_sort_from_pathkeys(inner_plan,
! 												   best_path->innersortkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
--- 3615,3628 ----
  
  	if (best_path->innersortkeys)
  	{
! 		Sort	   *sort;
! 		int			n_common_pathkeys;
! 
! 		n_common_pathkeys = pathkeys_common(best_path->innersortkeys,
! 									best_path->jpath.innerjoinpath->pathkeys);
! 
! 		sort = make_sort_from_pathkeys(inner_plan, best_path->innersortkeys,
! 									   n_common_pathkeys);
  
  		label_sort_with_costsize(root, sort, -1.0);
  		inner_plan = (Plan *) sort;
*************** label_sort_with_costsize(PlannerInfo *ro
*** 4617,4623 ****
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
--- 4640,4647 ----
  	Plan	   *lefttree = plan->plan.lefttree;
  	Path		sort_path;		/* dummy for result of cost_sort */
  
! 	cost_sort(&sort_path, root, NIL, 0,
! 			  lefttree->startup_cost,
  			  lefttree->total_cost,
  			  lefttree->plan_rows,
  			  lefttree->plan_width,
*************** make_mergejoin(List *tlist,
*** 5139,5145 ****
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
--- 5163,5169 ----
   * nullsFirst arrays already.
   */
  static Sort *
! make_sort(Plan *lefttree, int numCols, int skipCols,
  		  AttrNumber *sortColIdx, Oid *sortOperators,
  		  Oid *collations, bool *nullsFirst)
  {
*************** make_sort(Plan *lefttree, int numCols,
*** 5151,5156 ****
--- 5175,5181 ----
  	plan->lefttree = lefttree;
  	plan->righttree = NULL;
  	node->numCols = numCols;
+ 	node->skipCols = skipCols;
  	node->sortColIdx = sortColIdx;
  	node->sortOperators = sortOperators;
  	node->collations = collations;
*************** find_ec_member_for_tle(EquivalenceClass 
*** 5477,5483 ****
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
--- 5502,5508 ----
   *	  'pathkeys' is the list of pathkeys by which the result is to be sorted
   */
  static Sort *
! make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, int skipCols)
  {
  	int			numsortkeys;
  	AttrNumber *sortColIdx;
*************** make_sort_from_pathkeys(Plan *lefttree, 
*** 5497,5503 ****
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5522,5528 ----
  										  &nullsFirst);
  
  	/* Now build the Sort node */
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5540,5546 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5565,5571 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, 0,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
*************** make_sort_from_sortclauses(List *sortcls
*** 5561,5567 ****
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
--- 5586,5593 ----
  static Sort *
  make_sort_from_groupcols(List *groupcls,
  						 AttrNumber *grpColIdx,
! 						 Plan *lefttree,
! 						 int skipCols)
  {
  	List	   *sub_tlist = lefttree->targetlist;
  	ListCell   *l;
*************** make_sort_from_groupcols(List *groupcls,
*** 5594,5600 ****
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
--- 5620,5626 ----
  		numsortkeys++;
  	}
  
! 	return make_sort(lefttree, numsortkeys, skipCols,
  					 sortColIdx, sortOperators,
  					 collations, nullsFirst);
  }
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
new file mode 100644
index 805aae7..094d28f
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "parser/parse_clause.h"
  #include "rewrite/rewriteManip.h"
  #include "utils/lsyscache.h"
+ #include "utils/selfuncs.h"
  #include "utils/syscache.h"
  
  
*************** build_minmax_path(PlannerInfo *root, Min
*** 344,349 ****
--- 345,351 ----
  	Path	   *sorted_path;
  	Cost		path_cost;
  	double		path_fraction;
+ 	double	   *psort_num_groups;
  
  	/*
  	 * We are going to construct what is effectively a sub-SELECT query, so
*************** build_minmax_path(PlannerInfo *root, Min
*** 454,464 ****
  	else
  		path_fraction = 1.0;
  
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction);
  	if (!sorted_path)
  		return false;
  
--- 456,470 ----
  	else
  		path_fraction = 1.0;
  
+ 	psort_num_groups = estimate_pathkeys_groups(subroot->query_pathkeys,
+ 												subroot,
+ 												final_rel->rows);
  	sorted_path =
  		get_cheapest_fractional_path_for_pathkeys(final_rel->pathlist,
  												  subroot->query_pathkeys,
  												  NULL,
! 												  path_fraction,
! 												  psort_num_groups);
  	if (!sorted_path)
  		return false;
  
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index 174210b..2a4398c
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** create_grouping_paths(PlannerInfo *root,
*** 3638,3651 ****
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			bool		is_sorted;
  
! 			is_sorted = pathkeys_contained_in(root->group_pathkeys,
! 											  path->pathkeys);
! 			if (path == cheapest_path || is_sorted)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (!is_sorted)
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
--- 3638,3651 ----
  		foreach(lc, input_rel->pathlist)
  		{
  			Path	   *path = (Path *) lfirst(lc);
! 			int			n_common_pathkeys;
  
! 			n_common_pathkeys = pathkeys_common(root->group_pathkeys,
! 												path->pathkeys);
! 			if (path == cheapest_path || n_common_pathkeys > 0)
  			{
  				/* Sort the cheapest-total path if it isn't already sorted */
! 				if (n_common_pathkeys < list_length(root->group_pathkeys))
  					path = (Path *) create_sort_path(root,
  													 grouped_rel,
  													 path,
*************** create_ordered_paths(PlannerInfo *root,
*** 4301,4313 ****
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		bool		is_sorted;
  
! 		is_sorted = pathkeys_contained_in(root->sort_pathkeys,
! 										  path->pathkeys);
! 		if (path == cheapest_input_path || is_sorted)
  		{
! 			if (!is_sorted)
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
--- 4301,4313 ----
  	foreach(lc, input_rel->pathlist)
  	{
  		Path	   *path = (Path *) lfirst(lc);
! 		int			n_common_pathkeys;
  
! 		n_common_pathkeys = pathkeys_common(root->sort_pathkeys,
! 											path->pathkeys);
! 		if (path == cheapest_input_path || n_common_pathkeys > 0)
  		{
! 			if (n_common_pathkeys < list_length(root->sort_pathkeys))
  			{
  				/* An explicit sort here can take advantage of LIMIT */
  				path = (Path *) create_sort_path(root,
*************** plan_cluster_use_sort(Oid tableOid, Oid 
*** 5281,5288 ****
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL,
! 			  seqScanPath->total_cost, rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
--- 5281,5289 ----
  
  	/* Estimate the cost of seq scan + sort */
  	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
! 	cost_sort(&seqScanAndSortPath, root, NIL, 0,
! 			  seqScanPath->startup_cost, seqScanPath->total_cost,
! 			  rel->tuples, rel->reltarget->width,
  			  comparisonCost, maintenance_work_mem, -1.0);
  
  	/* Estimate the cost of index scan */
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
new file mode 100644
index 6edefb1..cc3788f
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** build_subplan(PlannerInfo *root, Plan *p
*** 838,844 ****
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecMaterializesOutput(nodeTag(plan)))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
--- 838,844 ----
  		 * unnecessarily, so we don't.
  		 */
  		else if (splan->parParam == NIL && enable_material &&
! 				 !ExecPlanMaterializesOutput(plan))
  			plan = materialize_finished_plan(plan);
  
  		result = (Node *) splan;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
new file mode 100644
index b714783..4b36822
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
*************** choose_hashed_setop(PlannerInfo *root, L
*** 964,970 ****
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
--- 964,971 ----
  	sorted_p.startup_cost = input_path->startup_cost;
  	sorted_p.total_cost = input_path->total_cost;
  	/* XXX cost_sort doesn't actually look at pathkeys, so just pass NIL */
! 	cost_sort(&sorted_p, root, NIL, 0, 
! 			  sorted_p.startup_cost, sorted_p.total_cost,
  			  input_path->rows, input_path->pathtarget->width,
  			  0.0, work_mem, -1.0);
  	cost_group(&sorted_p, root, numGroupCols, dNumGroups,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
new file mode 100644
index abb7507..52fbecc
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*************** compare_path_costs(Path *path1, Path *pa
*** 95,101 ****
  }
  
  /*
!  * compare_path_fractional_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
--- 95,101 ----
  }
  
  /*
!  * compare_fractional_path_costs
   *	  Return -1, 0, or +1 according as path1 is cheaper, the same cost,
   *	  or more expensive than path2 for fetching the specified fraction
   *	  of the total tuples.
*************** compare_fractional_path_costs(Path *path
*** 124,129 ****
--- 124,170 ----
  }
  
  /*
+  * compare_bifractional_path_costs
+  *	  Return -1, 0, or +1 according as fetching the fraction1 tuples of path1 is
+  *	  cheaper, the same cost, or more expensive than fetching fraction2 tuples
+  *	  of path2.
+  *
+  * fraction1 and fraction2 are fractions of total tuples between 0 and 1.
+  * If fraction is <= 0 or > 1, we interpret it as 1, ie, we select the
+  * path with the cheaper total_cost.
+  */
+ 
+ /*
+  * Compare cost of two paths assuming different fractions of tuples be returned
+  * from each paths.
+  */
+ int
+ compare_bifractional_path_costs(Path *path1, Path *path2,
+ 								double fraction1, double fraction2)
+ {
+ 	Cost		cost1,
+ 				cost2;
+ 
+ 	if (fraction1 <= 0.0 || fraction1 >= 1.0)
+ 		fraction1 = 1.0;
+ 	if (fraction2 <= 0.0 || fraction2 >= 1.0)
+ 		fraction2 = 1.0;
+ 
+ 	if (fraction1 == 1.0 && fraction2 == 1.0)
+ 		return compare_path_costs(path1, path2, TOTAL_COST);
+ 
+ 	cost1 = path1->startup_cost +
+ 		fraction1 * (path1->total_cost - path1->startup_cost);
+ 	cost2 = path2->startup_cost +
+ 		fraction2 * (path2->total_cost - path2->startup_cost);
+ 	if (cost1 < cost2)
+ 		return -1;
+ 	if (cost1 > cost2)
+ 		return +1;
+ 	return 0;
+ }
+ 
+ /*
   * compare_path_costs_fuzzily
   *	  Compare the costs of two paths to see if either can be said to
   *	  dominate the other.
*************** create_merge_append_path(PlannerInfo *ro
*** 1296,1307 ****
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
--- 1337,1349 ----
  	foreach(l, subpaths)
  	{
  		Path	   *subpath = (Path *) lfirst(l);
+ 		int			n_common_pathkeys = pathkeys_common(pathkeys, subpath->pathkeys);
  
  		pathnode->path.rows += subpath->rows;
  		pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
  			subpath->parallel_safe;
  
! 		if (n_common_pathkeys == list_length(pathkeys))
  		{
  			/* Subpath is adequately ordered, we won't need to sort it */
  			input_startup_cost += subpath->startup_cost;
*************** create_merge_append_path(PlannerInfo *ro
*** 1315,1320 ****
--- 1357,1364 ----
  			cost_sort(&sort_path,
  					  root,
  					  pathkeys,
+ 					  n_common_pathkeys,
+ 					  subpath->startup_cost,
  					  subpath->total_cost,
  					  subpath->parent->tuples,
  					  subpath->pathtarget->width,
*************** create_unique_path(PlannerInfo *root, Re
*** 1551,1557 ****
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
--- 1595,1602 ----
  		/*
  		 * Estimate cost for sort+unique implementation
  		 */
! 		cost_sort(&sort_path, root, NIL, 0,
! 				  subpath->startup_cost,
  				  subpath->total_cost,
  				  rel->rows,
  				  subpath->pathtarget->width,
*************** create_sort_path(PlannerInfo *root,
*** 2337,2342 ****
--- 2382,2392 ----
  				 double limit_tuples)
  {
  	SortPath   *pathnode = makeNode(SortPath);
+ 	int			n_common_pathkeys;
+ 
+ 	n_common_pathkeys = pathkeys_common(subpath->pathkeys, pathkeys);
+ 
+ 	Assert(n_common_pathkeys < list_length(pathkeys));
  
  	pathnode->path.pathtype = T_Sort;
  	pathnode->path.parent = rel;
*************** create_sort_path(PlannerInfo *root,
*** 2349,2358 ****
  		subpath->parallel_safe;
  	pathnode->path.parallel_workers = subpath->parallel_workers;
  	pathnode->path.pathkeys = pathkeys;
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root, pathkeys,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
--- 2399,2411 ----
  		subpath->parallel_safe;
  	pathnode->path.parallel_workers = subpath->parallel_workers;
  	pathnode->path.pathkeys = pathkeys;
+ 	pathnode->skipCols = n_common_pathkeys;
  
  	pathnode->subpath = subpath;
  
! 	cost_sort(&pathnode->path, root,
! 			  pathkeys, n_common_pathkeys,
! 			  subpath->startup_cost,
  			  subpath->total_cost,
  			  subpath->rows,
  			  subpath->pathtarget->width,
*************** create_groupingsets_path(PlannerInfo *ro
*** 2624,2630 ****
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
--- 2677,2684 ----
  				break;
  
  			/* Account for cost of sort, but don't charge input cost again */
! 			cost_sort(&sort_path, root, NIL, 0,
! 					  0.0,
  					  0.0,
  					  subpath->rows,
  					  subpath->pathtarget->width,
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
new file mode 100644
index fe44d56..8d1717c
*** a/src/backend/utils/adt/orderedsetaggs.c
--- b/src/backend/utils/adt/orderedsetaggs.c
*************** ordered_set_startup(FunctionCallInfo fci
*** 276,282 ****
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
--- 276,282 ----
  												   qstate->sortOperators,
  												   qstate->sortCollations,
  												   qstate->sortNullsFirsts,
! 												   work_mem, false, false);
  	else
  		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
  													qstate->sortOperator,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
new file mode 100644
index 56943f2..213d045
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** estimate_num_groups(PlannerInfo *root, L
*** 3506,3511 ****
--- 3506,3547 ----
  }
  
  /*
+  * estimate_pathkeys_groups	- Estimate number of groups which dataset is
+  * 							  divided to by pathkeys.
+  *
+  * Returns an array of group numbers. i'th element of array is number of groups
+  * which first i pathkeys divides dataset into.  Actually is a convenience
+  * wrapper over estimate_num_groups().
+  */
+ double *
+ estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root, double tuples)
+ {
+ 	ListCell   *l;
+ 	List	   *groupExprs = NIL;
+ 	double	   *result;
+ 	int			i;
+ 
+ 	/*
+ 	 * Get number of groups for each prefix of pathkeys.
+ 	 */
+ 	i = 0;
+ 	result = (double *) palloc(sizeof(double) * list_length(pathkeys));
+ 	foreach(l, pathkeys)
+ 	{
+ 		PathKey *key = (PathKey *)lfirst(l);
+ 		EquivalenceMember *member = (EquivalenceMember *)
+ 							linitial(key->pk_eclass->ec_members);
+ 
+ 		groupExprs = lappend(groupExprs, member->em_expr);
+ 
+ 		result[i] = estimate_num_groups(root, groupExprs, tuples, NULL);
+ 		i++;
+ 	}
+ 
+ 	return result;
+ }
+ 
+ /*
   * Estimate hash bucketsize fraction (ie, number of entries in a bucket
   * divided by total tuples in relation) if the specified expression is used
   * as a hash key.
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index d600670..1de488e
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
*************** struct Tuplesortstate
*** 262,267 ****
--- 262,272 ----
  	int64		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
+ 	TupSortStatus maxStatus;	/* maximum status reached between sort groups */
+ 	int64		maxMem;			/* maximum amount of memory used between
+ 								   sort groups */
+ 	bool		maxMemOnDisk;	/* is maxMem value for on-disk memory */
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;	/* memory context holding most sort data */
  	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
  	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
*************** static void readtup_datum(Tuplesortstate
*** 612,617 ****
--- 617,625 ----
  			  int tapenum, unsigned int len);
  static void movetup_datum(void *dest, void *src, unsigned int len);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+ static void tuplesort_free(Tuplesortstate *state, bool delete);
+ static void tuplesort_updatemax(Tuplesortstate *state);
+ 
  
  /*
   * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
*************** static Tuplesortstate *
*** 646,664 ****
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Create a working memory context for this sort operation. All data
! 	 * needed by the sort will live inside this context.
  	 */
! 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
--- 654,683 ----
  tuplesort_begin_common(int workMem, bool randomAccess)
  {
  	Tuplesortstate *state;
+ 	MemoryContext maincontext;
  	MemoryContext sortcontext;
  	MemoryContext tuplecontext;
  	MemoryContext oldcontext;
  
  	/*
! 	 * Memory context surviving tuplesort_reset.  This memory context holds
! 	 * data which is useful to keep while sorting multiple similar batches.
  	 */
! 	maincontext = AllocSetContextCreate(CurrentMemoryContext,
  										"TupleSort main",
  										ALLOCSET_DEFAULT_SIZES);
  
  	/*
+ 	 * Create a working memory context for one sort operation.  The content of
+ 	 * this context is deleted by tuplesort_reset.
+ 	 */
+ 	sortcontext = AllocSetContextCreate(maincontext,
+ 										"TupleSort sort",
+ 										ALLOCSET_DEFAULT_MINSIZE,
+ 										ALLOCSET_DEFAULT_INITSIZE,
+ 										ALLOCSET_DEFAULT_MAXSIZE);
+ 
+ 	/*
  	 * Caller tuple (e.g. IndexTuple) memory context.
  	 *
  	 * A dedicated child context used exclusively for caller passed tuples
*************** tuplesort_begin_common(int workMem, bool
*** 675,681 ****
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(sortcontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
--- 694,700 ----
  	 * Make the Tuplesortstate within the per-sort context.  This way, we
  	 * don't need a separate pfree() operation for it at shutdown.
  	 */
! 	oldcontext = MemoryContextSwitchTo(maincontext);
  
  	state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
  
*************** tuplesort_begin_common(int workMem, bool
*** 693,698 ****
--- 712,718 ----
  	state->availMem = state->allowedMem;
  	state->sortcontext = sortcontext;
  	state->tuplecontext = tuplecontext;
+ 	state->maincontext = maincontext;
  	state->tapeset = NULL;
  
  	state->memtupcount = 0;
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 733,745 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  	AssertArg(nkeys > 0);
  
--- 753,766 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev)
  {
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  	AssertArg(nkeys > 0);
  
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 782,788 ****
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0);
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
--- 803,809 ----
  		sortKey->ssup_nulls_first = nullsFirstFlags[i];
  		sortKey->ssup_attno = attNums[i];
  		/* Convey if abbreviation optimization is applicable in principle */
! 		sortKey->abbreviate = (i == 0) && !skipAbbrev;
  
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
*************** tuplesort_begin_cluster(TupleDesc tupDes
*** 813,819 ****
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 834,840 ----
  
  	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_btree(Relation hea
*** 905,911 ****
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 926,932 ----
  	MemoryContext oldcontext;
  	int			i;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_index_hash(Relation heap
*** 979,985 ****
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1000,1006 ----
  	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
  	MemoryContext oldcontext;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_begin_datum(Oid datumType, Oid
*** 1017,1023 ****
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
--- 1038,1044 ----
  	int16		typlen;
  	bool		typbyval;
  
! 	oldcontext = MemoryContextSwitchTo(state->maincontext);
  
  #ifdef TRACE_SORT
  	if (trace_sort)
*************** tuplesort_set_bound(Tuplesortstate *stat
*** 1129,1144 ****
  }
  
  /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
   *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
   */
! void
! tuplesort_end(Tuplesortstate *state)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
--- 1150,1161 ----
  }
  
  /*
!  * tuplesort_free
   *
!  *	Internal routine for freeing resources of tuplesort.
   */
! static void
! tuplesort_free(Tuplesortstate *state, bool delete)
  {
  	/* context swap probably not needed, but let's be safe */
  	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
*************** tuplesort_end(Tuplesortstate *state)
*** 1197,1203 ****
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	MemoryContextDelete(state->sortcontext);
  }
  
  /*
--- 1214,1307 ----
  	 * Free the per-sort memory context, thereby releasing all working memory,
  	 * including the Tuplesortstate struct itself.
  	 */
! 	if (delete)
! 	{
! 		MemoryContextDelete(state->maincontext);
! 	}
! 	else
! 	{
! 		MemoryContextResetOnly(state->sortcontext);
! 		MemoryContextResetOnly(state->tuplecontext);
! 	}
! }
! 
! /*
!  * tuplesort_end
!  *
!  *	Release resources and clean up.
!  *
!  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
!  * pointing to garbage.  Be careful not to attempt to use or free such
!  * pointers afterwards!
!  */
! void
! tuplesort_end(Tuplesortstate *state)
! {
! 	tuplesort_free(state, true);
! }
! 
! /*
!  * tuplesort_updatemax 
!  *
!  *	Update maximum resource usage statistics.
!  */
! static void
! tuplesort_updatemax(Tuplesortstate *state)
! {
! 	int64	memUsed;
! 	bool	memUsedOnDisk;
! 
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
! 		memUsedOnDisk = true;
! 		memUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
! 	}
! 	else
! 	{
! 		memUsedOnDisk = false;
! 		memUsed = state->allowedMem - state->availMem;
! 	}
! 
! 	state->maxStatus = Max(state->maxStatus, state->status);
! 	if (memUsed > state->maxMem)
! 	{
! 		state->maxMem = memUsed;
! 		state->maxMemOnDisk = memUsedOnDisk;
! 	}
! }
! 
! /*
!  * tuplesort_reset
!  *
!  *	Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
!  *	meta-information in.  After tuplesort_reset, tuplesort is ready to start
!  *	a new sort.  It allows evade recreation of tuple sort (and save resources)
!  *	when sorting multiple small batches.
!  */
! void
! tuplesort_reset(Tuplesortstate *state)
! {
! 	tuplesort_updatemax(state);
! 	tuplesort_free(state, false);
! 	state->status = TSS_INITIAL;
! 	state->memtupcount = 0;
! 	state->boundUsed = false;
! 	state->tapeset = NULL;
! 	state->currentRun = 0;
! 	state->result_tape = -1;
! 	state->bounded = false;
! 	state->batchUsed = false;
! 	state->availMem = state->allowedMem;
! 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  }
  
  /*
*************** tuplesort_get_stats(Tuplesortstate *stat
*** 3574,3600 ****
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	/*
! 	 * Note: it might seem we should provide both memory and disk usage for a
! 	 * disk-based sort.  However, the current code doesn't track memory space
! 	 * accurately once we have begun to return tuples to the caller (since we
! 	 * don't account for pfree's the caller is expected to do), so we cannot
! 	 * rely on availMem in a disk sort.  This does not seem worth the overhead
! 	 * to fix.  Is it worth creating an API for the memory context code to
! 	 * tell us how much is actually used in sortcontext?
! 	 */
! 	if (state->tapeset)
! 	{
  		*spaceType = "Disk";
- 		*spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
- 	}
  	else
- 	{
  		*spaceType = "Memory";
! 		*spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
! 	}
  
! 	switch (state->status)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
--- 3678,3692 ----
  					const char **spaceType,
  					long *spaceUsed)
  {
! 	tuplesort_updatemax(state);
! 
! 	if (state->maxMemOnDisk)
  		*spaceType = "Disk";
  	else
  		*spaceType = "Memory";
! 	*spaceUsed = (state->maxMem + 1023) / 1024;
  
! 	switch (state->maxStatus)
  	{
  		case TSS_SORTEDINMEM:
  			if (state->boundUsed)
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
new file mode 100644
index 39521ed..f172330
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
*************** extern void ExecRestrPos(PlanState *node
*** 107,112 ****
--- 107,113 ----
  extern bool ExecSupportsMarkRestore(struct Path *pathnode);
  extern bool ExecSupportsBackwardScan(Plan *node);
  extern bool ExecMaterializesOutput(NodeTag plantype);
+ extern bool ExecPlanMaterializesOutput(Plan *node);
  
  /*
   * prototypes from functions in execCurrent.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
new file mode 100644
index e28477d..0bd2add
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
*************** typedef struct MaterialState
*** 1773,1778 ****
--- 1773,1792 ----
  	Tuplestorestate *tuplestorestate;
  } MaterialState;
  
+ 
+ /* ----------------
+  *	 When performing sorting by multiple keys input dataset could be already
+  *	 presorted by some prefix of these keys.  We call them "skip keys".
+  *	 SkipKeyData represents information about one such key.
+  * ----------------
+  */
+ typedef struct SkipKeyData
+ {
+ 	FmgrInfo				flinfo;	/* comparison function info */
+ 	FunctionCallInfoData	fcinfo;	/* comparison function call info */
+ 	OffsetNumber			attno;	/* attribute number in tuple */
+ } SkipKeyData;
+ 
  /* ----------------
   *	 SortState information
   * ----------------
*************** typedef struct SortState
*** 1784,1792 ****
--- 1798,1811 ----
  	bool		bounded;		/* is the result set bounded? */
  	int64		bound;			/* if bounded, how many tuples are needed */
  	bool		sort_Done;		/* sort completed yet? */
+ 	bool		finished;		/* fetching tuples from outer node
+ 								   is finished ? */
  	bool		bounded_Done;	/* value of bounded we did the sort with */
  	int64		bound_Done;		/* value of bound we did the sort with */
  	void	   *tuplesortstate; /* private state of tuplesort.c */
+ 	SkipKeyData *skipKeys;		/* keys, dataset is presorted by */
+ 	long		groupsCount;	/* number of groups with equal skip keys */
+ 	HeapTuple	prev;			/* previous tuple from outer node */
  } SortState;
  
  /* ---------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
new file mode 100644
index e2fbc7d..ffab23f
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
*************** typedef struct Sort
*** 673,678 ****
--- 673,679 ----
  {
  	Plan		plan;
  	int			numCols;		/* number of sort-key columns */
+ 	int			skipCols;
  	AttrNumber *sortColIdx;		/* their indexes in the target list */
  	Oid		   *sortOperators;	/* OIDs of operators to sort them by */
  	Oid		   *collations;		/* OIDs of collations */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
new file mode 100644
index 2709cc7..76b4365
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
*************** typedef struct SortPath
*** 1307,1312 ****
--- 1307,1313 ----
  {
  	Path		path;
  	Path	   *subpath;		/* path representing input source */
+ 	int			skipCols;
  } SortPath;
  
  /*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
new file mode 100644
index 2a4df2f..516357f
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_ctescan(Path *path, Pla
*** 95,102 ****
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, Cost input_cost, double tuples, int width,
! 		  Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
--- 95,103 ----
  			 RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_recursive_union(Path *runion, Path *nrterm, Path *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
! 		  List *pathkeys, int presorted_keys,
! 		  Cost input_startup_cost, Cost input_total_cost,
! 		  double tuples, int width, Cost comparison_cost, int sort_mem,
  		  double limit_tuples);
  extern void cost_merge_append(Path *path, PlannerInfo *root,
  				  List *pathkeys, int n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
new file mode 100644
index 71d9154..1b19dd0
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
*************** extern int compare_path_costs(Path *path
*** 24,29 ****
--- 24,31 ----
  				   CostSelector criterion);
  extern int compare_fractional_path_costs(Path *path1, Path *path2,
  							  double fraction);
+ extern int compare_bifractional_path_costs(Path *path1, Path *path2,
+ 							  double fraction1, double fraction2);
  extern void set_cheapest(RelOptInfo *parent_rel);
  extern void add_path(RelOptInfo *parent_rel, Path *new_path);
  extern bool add_path_precheck(RelOptInfo *parent_rel,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
new file mode 100644
index 44abe83..2695080
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** typedef enum
*** 175,187 ****
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--- 175,189 ----
  
  extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
  extern bool pathkeys_contained_in(List *keys1, List *keys2);
+ extern int pathkeys_common(List *keys1, List *keys2);
  extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
  							   Relids required_outer,
  							   CostSelector cost_criterion);
  extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
  										  List *pathkeys,
  										  Relids required_outer,
! 										  double fraction,
! 										  double *num_groups);
  extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
  					 ScanDirection scandir);
  extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
new file mode 100644
index 8e0d317..06c0d7d
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
*************** extern void mergejoinscansel(PlannerInfo
*** 230,235 ****
--- 230,238 ----
  extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
  					double input_rows, List **pgset);
  
+ extern double *estimate_pathkeys_groups(List *pathkeys, PlannerInfo *root,
+ 										double tuples);
+ 
  extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
  						 double nbuckets);
  
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
new file mode 100644
index 5cecd6d..6476504
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_h
*** 62,68 ****
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
--- 62,69 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, Oid *sortCollations,
  					 bool *nullsFirstFlags,
! 					 int workMem, bool randomAccess,
! 					 bool skipAbbrev);
  extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
  						Relation indexRel,
  						int workMem, bool randomAccess);
*************** extern bool tuplesort_skiptuples(Tupleso
*** 106,111 ****
--- 107,114 ----
  
  extern void tuplesort_end(Tuplesortstate *state);
  
+ extern void tuplesort_reset(Tuplesortstate *state);
+ 
  extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **sortMethod,
  					const char **spaceType,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
new file mode 100644
index 45208a6..85cb8f3
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t
*** 995,1009 ****
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                       QUERY PLAN                       
! -------------------------------------------------------
!  HashAggregate
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Merge Join
!          Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!          ->  Index Scan using t1_pkey on t1
!          ->  Index Scan using t2_pkey on t2
! (6 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
--- 995,1012 ----
  explain (costs off) select t1.*,t2.x,t2.z
  from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
  group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Group
     Group Key: t1.a, t1.b, t2.x, t2.z
!    ->  Sort
!          Sort Key: t1.a, t1.b, t2.z
!          Presorted Key: t1.a, t1.b
!          ->  Merge Join
!                Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
!                ->  Index Scan using t1_pkey on t1
!                ->  Index Scan using t2_pkey on t2
! (9 rows)
  
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
new file mode 100644
index d8b5b1d..752f1b6
*** a/src/test/regress/expected/inherit.out
--- b/src/test/regress/expected/inherit.out
*************** ORDER BY thousand, tenthous;
*** 1359,1366 ****
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
     ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (6 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
--- 1359,1367 ----
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1
     ->  Sort
           Sort Key: tenk1_1.thousand, tenk1_1.thousand
+          Presorted Key: tenk1_1.thousand
           ->  Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
! (7 rows)
  
  explain (costs off)
  SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
*************** ORDER BY x, y;
*** 1443,1450 ****
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
     ->  Sort
           Sort Key: b.unique2, b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (6 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
--- 1444,1452 ----
     ->  Index Only Scan using tenk1_thous_tenthous on tenk1 a
     ->  Sort
           Sort Key: b.unique2, b.unique2
+          Presorted Key: b.unique2
           ->  Index Only Scan using tenk1_unique2 on tenk1 b
! (7 rows)
  
  -- exercise rescan code path via a repeatedly-evaluated subquery
  explain (costs off)
#86Michael Paquier
michael.paquier@gmail.com
In reply to: Alexander Korotkov (#85)
Re: PoC: Partial sort

On Tue, Sep 13, 2016 at 5:32 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort
methods is
very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory.
In the case of planner mistake it causes OOM. I met such situation in
production and not once. This is why I'd like partial sort to have
graceful
degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

Patch was rebased to current master.

Applies on HEAD at e8bdee27 and passes make-check, now I am seeing
zero documentation so it is a bit hard to see what this patch is
achieving without reading the thread.

$ git diff master --check
src/backend/optimizer/prep/prepunion.c:967: trailing whitespace.
+ cost_sort(&sorted_p, root, NIL, 0,
src/backend/utils/sort/tuplesort.c:1244: trailing whitespace.
+ * tuplesort_updatemax

+ * Returns true if the plan node isautomatically materializes its output
Typo here.

Still, this has received to reviews, so moved to next CF.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Michael Paquier (#86)
Re: PoC: Partial sort

Hi Peter,

This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet in this commitfest on the latest
patch posted by the author. If you don't have any comments on the patch,
please move the patch into "ready for committer" state to get committer's
attention. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.

Regards,
Hari Babu
Fujitsu Australia

#88Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#85)
Re: PoC: Partial sort

On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort
methods is
very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory.
In the case of planner mistake it causes OOM. I met such situation in
production and not once. This is why I'd like partial sort to have
graceful
degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

Patch was rebased to current master.

Just a few quick observations on this...

It strikes me that the API contract change in ExecMaterializesOutput
is pretty undesirable. I think it would be better to have a new
executor node for this node rather than overloading the existing
"Sort" node, sharing code where possible of course. The fact that
this would distinguish them more clearly in an EXPLAIN plan seems
good, too. "Partial Sort" is the obvious thing, but there might be
even better alternatives -- maybe "Incremental Sort" or something like
that? Because it's not partially sorting the data, it's making data
that already has some sort order have a more rigorous sort order.

I think that it will probably be pretty common to have queries where
the data is sorted by (mostly_unique_col) and we want to get it sorted
by (mostly_unique_col, disambiguation_col). In such cases I wonder if
we'll incur a lot of overhead by feeding single tuples to the
tuplesort stuff and performing lots of 1-item sorts. Not sure if that
case needs any special optimization.

I also think that the "HeapTuple prev" bit in SortState is probably
not the right way of doing things. I think that should use an
additional TupleTableSlot rather than a HeapTuple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Peter Geoghegan
pg@heroku.com
In reply to: Haribabu Kommi (#87)
Re: PoC: Partial sort

On Mon, Nov 21, 2016 at 11:04 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet in this commitfest on the latest
patch posted by the author. If you don't have any comments on the patch,
please move the patch into "ready for committer" state to get committer's
attention. This will help us in smoother operation of commitfest.

Sorry for the delay on this.

I agree with Robert's remarks today on TupleTableSlot, and would like
to see a revision that does that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Robert Haas (#88)
Re: PoC: Partial sort

On Fri, Dec 2, 2016 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Hmm... I'm not completely agree with that. In typical usage partial

sort

should definitely use quicksort. However, fallback to other sort
methods is
very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory.
In the case of planner mistake it causes OOM. I met such situation in
production and not once. This is why I'd like partial sort to have
graceful
degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

Patch was rebased to current master.

Just a few quick observations on this...

It strikes me that the API contract change in ExecMaterializesOutput
is pretty undesirable. I think it would be better to have a new
executor node for this node rather than overloading the existing
"Sort" node, sharing code where possible of course. The fact that
this would distinguish them more clearly in an EXPLAIN plan seems
good, too. "Partial Sort" is the obvious thing, but there might be
even better alternatives -- maybe "Incremental Sort" or something like
that? Because it's not partially sorting the data, it's making data
that already has some sort order have a more rigorous sort order.

I think that it will probably be pretty common to have queries where
the data is sorted by (mostly_unique_col) and we want to get it sorted
by (mostly_unique_col, disambiguation_col). In such cases I wonder if
we'll incur a lot of overhead by feeding single tuples to the
tuplesort stuff and performing lots of 1-item sorts. Not sure if that
case needs any special optimization.

I also think that the "HeapTuple prev" bit in SortState is probably
not the right way of doing things. I think that should use an
additional TupleTableSlot rather than a HeapTuple.

The feedback from the reviewer has received at the end of commitfest.
So Moved to next CF with "waiting on author" status.

Regards,
Hari Babu
Fujitsu Australia

#91Michael Paquier
michael.paquier@gmail.com
In reply to: Haribabu Kommi (#90)
Re: PoC: Partial sort

On Mon, Dec 5, 2016 at 2:04 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Fri, Dec 2, 2016 at 4:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Hmm... I'm not completely agree with that. In typical usage partial
sort
should definitely use quicksort. However, fallback to other sort
methods is
very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory.
In the case of planner mistake it causes OOM. I met such situation
in
production and not once. This is why I'd like partial sort to have
graceful
degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

Patch was rebased to current master.

Just a few quick observations on this...

It strikes me that the API contract change in ExecMaterializesOutput
is pretty undesirable. I think it would be better to have a new
executor node for this node rather than overloading the existing
"Sort" node, sharing code where possible of course. The fact that
this would distinguish them more clearly in an EXPLAIN plan seems
good, too. "Partial Sort" is the obvious thing, but there might be
even better alternatives -- maybe "Incremental Sort" or something like
that? Because it's not partially sorting the data, it's making data
that already has some sort order have a more rigorous sort order.

I think that it will probably be pretty common to have queries where
the data is sorted by (mostly_unique_col) and we want to get it sorted
by (mostly_unique_col, disambiguation_col). In such cases I wonder if
we'll incur a lot of overhead by feeding single tuples to the
tuplesort stuff and performing lots of 1-item sorts. Not sure if that
case needs any special optimization.

I also think that the "HeapTuple prev" bit in SortState is probably
not the right way of doing things. I think that should use an
additional TupleTableSlot rather than a HeapTuple.

The feedback from the reviewer has received at the end of commitfest.
So Moved to next CF with "waiting on author" status.

This patch is on its 6th commit fest now. As the thread has died and
as feedback has been provided but not answered I am marking it as
returned with feedback.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers