Ordered Partitioned Table Scans

Started by David Rowleyabout 7 years ago84 messages

david.rowley@2ndquadrant.com

about 7 years ago

1 attachment(s)

RANGE partitioning of time-series data is quite a common range to use
partitioning, and such tables tend to grow fairly large. I thought
since we always store RANGE partitioned tables in the PartitionDesc in
ascending range order that it might be useful to make use of this and
when the required pathkeys match the order of the range, then we could
make use of an Append node instead of uselessly using a MergeAppend,
since the MergeAppend will just exhaust each subplan one at a time, in
order.

It does not seem very hard to implement this and it does not add much
in the way of additional processing to the planner.

Performance wise it seems to give a good boost to getting sorted
results from a partitioned table. I performed a quick test just on my
laptop with:

Setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ');' from generate_Series(0,299) x;
\gexec
\o
INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x;
create index on partbench (id);
vacuum analyze;

Test:
select * from partbench order by id limit 1 offset 29999999;

Results Patched:

Time: 4234.807 ms (00:04.235)
Time: 4237.928 ms (00:04.238)
Time: 4241.289 ms (00:04.241)
Time: 4234.030 ms (00:04.234)
Time: 4244.197 ms (00:04.244)
Time: 4266.000 ms (00:04.266)

Unpatched:

Time: 5917.288 ms (00:05.917)
Time: 5937.775 ms (00:05.938)
Time: 5911.146 ms (00:05.911)
Time: 5906.881 ms (00:05.907)
Time: 5918.309 ms (00:05.918)

(about 39% faster)

The implementation is fairly simple. One thing I don't like about is
I'd rather build_partition_pathkeys() performed all the checks to know
if the partition should support a natural pathkey, but as of now, I
have the calling code ensuring that there are no sub-partitioned
tables. These could cause tuples to be output in the wrong order.

Does this idea seem like something we'd want?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v1-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From 5f7c5b4a73175e9a7cb7115083e88e8ed3c540a6 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v1] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no sub-partitioned tables and no default
partition the subpaths of a MergeAppend are always arranged in range
order. This means that MergeAppend, when sorting by the partition key or a
superset of the partition key, will always output tuples from earlier
subpaths before later subpaths.  This guarantee means we can just use a
non-parallel Append node instead since this is exactly what Append does.
This gives a nice performance improvement, even for CPU native types which
can be sorted without much effort.
---
 contrib/postgres_fdw/expected/postgres_fdw.out |   6 +-
 src/backend/nodes/list.c                       |  38 ++++++
 src/backend/optimizer/path/allpaths.c          | 158 ++++++++++++++++++++++---
 src/backend/optimizer/path/joinrels.c          |   2 +-
 src/backend/optimizer/path/pathkeys.c          |  62 ++++++++++
 src/backend/optimizer/plan/planner.c           |   3 +-
 src/backend/optimizer/prep/prepunion.c         |   6 +-
 src/backend/optimizer/util/pathnode.c          |  12 +-
 src/include/nodes/pg_list.h                    |   1 +
 src/include/optimizer/pathnode.h               |   2 +-
 src/include/optimizer/paths.h                  |   2 +
 src/test/regress/expected/inherit.out          |  88 +++++++++++++-
 src/test/regress/expected/partition_prune.out  |  64 +++++-----
 src/test/regress/sql/inherit.sql               |  28 +++++
 src/test/regress/sql/partition_prune.sql       |  10 +-
 15 files changed, 408 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 21a2ef5ad3..4888bb7bea 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8397,12 +8397,12 @@ SELECT t1.wr, t2.wr FROM (SELECT t1 wr, a FROM fprt1 t1 WHERE t1.a % 25 = 0) t1
 --------------------------------------------------------
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
-   ->  Hash Full Join
-         Hash Cond: (t1.a = t2.b)
+   ->  Merge Full Join
+         Merge Cond: (t1.a = t2.b)
          ->  Append
                ->  Foreign Scan on ftprt1_p1 t1
                ->  Foreign Scan on ftprt1_p2 t1_1
-         ->  Hash
+         ->  Materialize
                ->  Append
                      ->  Foreign Scan on ftprt2_p1 t2
                      ->  Foreign Scan on ftprt2_p2 t2_1
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 55fd4c359b..139fae8216 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1314,6 +1314,44 @@ list_qsort(const List *list, list_qsort_comparator cmp)
 	return newlist;
 }
 
+/*
+ * list_reverse
+ *		Create and return a new shallow copy of 'oldlist', but in reverse order.
+ */
+List *
+list_reverse(const List *oldlist)
+{
+	List	   *newlist;
+	ListCell   *oldlist_cur;
+
+	if (oldlist == NIL)
+		return NIL;
+
+	newlist = new_list(oldlist->type);
+	newlist->length = oldlist->length;
+
+	/*
+	 * Copy over the data in the fist cell to the tail of the new list;
+	 * new_list() has already allocated the tail cell itself
+	 */
+	newlist->tail->data = oldlist->head->data;
+
+	for_each_cell(oldlist_cur, oldlist->head->next)
+	{
+		ListCell   *newlist_cur;
+
+		newlist_cur = (ListCell *) palloc(sizeof(*newlist_cur));
+		newlist_cur->data = oldlist_cur->data;
+
+		/* push the new cell onto the head of the list */
+		newlist_cur->next = newlist->head;
+		newlist->head = newlist_cur;
+	}
+
+	check_list_invariants(newlist);
+	return newlist;
+}
+
 /*
  * Temporary compatibility functions
  *
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5f74d3b36d..b7219d53ae 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -98,7 +98,8 @@ static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 						   List *live_childrels,
 						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+						   List *partitioned_rels,
+						   bool try_ordered_append);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
@@ -1381,6 +1382,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	ListCell   *l;
 	List	   *partitioned_rels = NIL;
 	double		partial_rows = -1;
+	bool		hassubparts = false;
 
 	/* If appropriate, consider parallel append */
 	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
@@ -1444,6 +1446,10 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		ListCell   *lcp;
 		Path	   *cheapest_partial_path = NULL;
 
+		/* Record if there are any children which are partitioned tables. */
+		if (childrel->part_scheme)
+			hassubparts = true;
+
 		/*
 		 * For UNION ALLs with non-empty partitioned_child_rels, accumulate
 		 * the Lists of child relations.
@@ -1597,7 +1603,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1639,7 +1645,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1689,7 +1695,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
@@ -1699,9 +1705,24 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
+	{
+		bool	try_ordered_append;
+
+		/*
+		 * We'll attempt to substitute MergeAppends for simple Appends for
+		 * partitioned tables guarantee an earlier partition contains earlier
+		 * tuples.  We only do this for base tables as sub-partitions paths
+		 * are flattened into the base table's append paths.
+		 */
+		try_ordered_append = !hassubparts &&
+							 rel->reloptkind == RELOPT_BASEREL &&
+							 rel->part_scheme != NULL;
+
 		generate_mergeappend_paths(root, rel, live_childrels,
 								   all_child_pathkeys,
-								   partitioned_rels);
+								   partitioned_rels,
+								   try_ordered_append);
+	}
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1751,7 +1772,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
@@ -1778,14 +1799,48 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
  * parameterized mergejoin plans, it might be worth adding support for
  * parameterized MergeAppends to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
+ *
+ * 'try_ordered_append' can be passed as true to have the function attempt
+ * to use an Append node in place of a MergeAppend node. Callers must ensure
+ * that 'rel' is a partitioned table which contains no live sub-partitioned
+ * tables.
  */
 static void
 generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 						   List *live_childrels,
 						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+						   List *partitioned_rels,
+						   bool try_ordered_append)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.  Ideally, all of the
+	 * logic to determine when this is possible would be defined in
+	 * build_partition_pathkeys(), but one reason where we must disable this
+	 * is when there are sub-partitioned tables.  We can't enable the
+	 * optimization in this case due to how we flatten MergeAppend subnodes.
+	 * It would be possible to work around this if we disabled the flattening
+	 * for this case, but it currently seems like more trouble than it's
+	 * worth.  The check for sub-partitioned tables is cheaper to implement in
+	 * the calling function since it's likely to have just processed the
+	 * live_children list and could have checked for sub-partitioned tables
+	 * along the way.  We let build_partition_pathkeys() handle the remaining
+	 * checks.
+	 */
+	if (try_ordered_append)
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection);
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection);
+	}
+
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1842,20 +1897,89 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 									  &total_subpaths, NULL);
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
-														rel,
-														startup_subpaths,
-														pathkeys,
-														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
+		/*
+		 * When the partitioned table's pathkeys are a prefix of the required
+		 * pathkeys, then there's no need to perform a MergeAppend. We're
+		 * already scanning the partitions in order so a simple Append will
+		 * suffice.  This has performance benefits during query execution.
+		 */
+		if (pathkeys_contained_in(pathkeys, partition_pathkeys))
+		{
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+													  startup_subpaths,
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+
+		}
+
+		/*
+		 * Perhaps a pathkeys match if we were to scan the partitions in
+		 * reverse order?
+		 */
+		else if (pathkeys_contained_in(pathkeys, partition_pathkeys_desc))
+		{
+			/*
+			 * XXX worth caching the reverse Lists? Perhaps it's unlikely that
+			 * there's more than 1 matching path.
+			 */
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+											list_reverse(startup_subpaths),
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+												list_reverse(total_subpaths),
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+
+		}
+
+		else
+		{
+			/* ... and build the MergeAppend paths */
 			add_path(rel, (Path *) create_merge_append_path(root,
 															rel,
-															total_subpaths,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -2016,7 +2140,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index d3d21fed5d..2f9fc50bf2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1231,7 +1231,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..cdee4a6ec2 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -25,6 +25,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -547,6 +548,67 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Note that for partitions that don't have a
+ *	  natural ordering, we return NIL.)
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir)
+{
+	PartitionScheme partscheme;
+	List	   *retval;
+	int			i;
+
+	/*
+	 * Only RANGE type partitions guarantee that the partitions can be scanned
+	 * in the order that they're defined in the PartitionDesc to provide
+	 * non-overlapping ranges of tuples.
+	 */
+	if (partrel->boundinfo->strategy != PARTITION_STRATEGY_RANGE ||
+		partition_bound_has_default(partrel->boundinfo))
+		return NIL;
+
+	retval = NIL;
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 *
+		 * Currently pass nulls_first according to the scan direction.  This
+		 * will cause the order not to match when NULLS LAST is specified.
+		 * We're missing an optimization opportunity here since no NULLs can
+		 * exist due to us requiring above that no DEFAULT partition exists,
+		 * which is the only place NULLs could be stored. Likely this is not
+		 * worth worrying about since we'd miss the same opportunity for a
+		 * table with a NOT NULL constraint.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		retval = lappend(retval, cpathkey);
+	}
+
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c729a99f8b..78b834032d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3899,6 +3899,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6878,7 +6879,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..6d4657a4c1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -656,7 +656,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -711,7 +711,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -822,7 +822,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d50d86b252..dbbf81f0ac 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1219,7 +1219,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1253,7 +1253,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1263,10 +1263,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -3587,7 +3591,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index e6cd2cdfba..83e8b62a55 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -271,6 +271,7 @@ extern List *list_copy_tail(const List *list, int nskip);
 
 typedef int (*list_qsort_comparator) (const void *a, const void *b);
 extern List *list_qsort(const List *list, list_qsort_comparator cmp);
+extern List *list_reverse(const List *list);
 
 /*
  * To ease migration to the new list API, a set of compatibility
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 81abcf53a8..5a790cf6be 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cafde307ad..ee958a0f07 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 4f29d9f891..0583d60d5c 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2032,6 +2032,86 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append is not used when there are live subpartitioned tables
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+   ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(9 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
 drop table mcrparted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
@@ -2045,17 +2125,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 24313e8c78..d7c268c5af 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3013,14 +3013,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3067,17 +3067,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3090,13 +3088,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3109,12 +3106,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3123,23 +3119,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index a6e541d4da..889e907c2e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -721,6 +721,34 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append is not used when there are live subpartitioned tables
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
 drop table mcrparted;
 
 -- check that partitioned table Appends cope with being referenced in
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index eca1a7c5ac..a834afd572 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -740,15 +740,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -769,7 +769,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 7 years ago

In reply to: David Rowley (#1)

Re: Ordered Partitioned Table Scans

On 2018/10/26 11:50, David Rowley wrote:

RANGE partitioning of time-series data is quite a common range to use
partitioning, and such tables tend to grow fairly large. I thought
since we always store RANGE partitioned tables in the PartitionDesc in
ascending range order that it might be useful to make use of this and
when the required pathkeys match the order of the range, then we could
make use of an Append node instead of uselessly using a MergeAppend,
since the MergeAppend will just exhaust each subplan one at a time, in
order.

It does not seem very hard to implement this and it does not add much
in the way of additional processing to the planner.

Performance wise it seems to give a good boost to getting sorted
results from a partitioned table. I performed a quick test just on my
laptop with:

Setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*100000)::text || ') TO (' ||
((x+1)*100000)::text || ');' from generate_Series(0,299) x;
\gexec
\o
INSERT INTO partbench SELECT x,1,2,3,4,5 from generate_Series(0,29999999) x;
create index on partbench (id);
vacuum analyze;

Test:
select * from partbench order by id limit 1 offset 29999999;

Results Patched:

Time: 4234.807 ms (00:04.235)
Time: 4237.928 ms (00:04.238)
Time: 4241.289 ms (00:04.241)
Time: 4234.030 ms (00:04.234)
Time: 4244.197 ms (00:04.244)
Time: 4266.000 ms (00:04.266)

Unpatched:

Time: 5917.288 ms (00:05.917)
Time: 5937.775 ms (00:05.938)
Time: 5911.146 ms (00:05.911)
Time: 5906.881 ms (00:05.907)
Time: 5918.309 ms (00:05.918)

(about 39% faster)

The implementation is fairly simple. One thing I don't like about is
I'd rather build_partition_pathkeys() performed all the checks to know
if the partition should support a natural pathkey, but as of now, I
have the calling code ensuring that there are no sub-partitioned
tables. These could cause tuples to be output in the wrong order.

Does this idea seem like something we'd want?

Definitely! Thanks for creating the patch.

I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
last year, but the partitioning-related planning code hadn't advanced then
as much as it has today, so they sort of postponed working on it.
Eventually their patch was returned with feedback last November. Here's
the link to their email in case you wanted to read some comments their
proposal and patch got, although some of them might be obsolete.

/messages/by-id/2401607.SfZhPQhbS4@ronan_laptop

Thanks,
Amit

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Amit Langote (#2)

Re: Ordered Partitioned Table Scans

On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
last year, but the partitioning-related planning code hadn't advanced then
as much as it has today, so they sort of postponed working on it.
Eventually their patch was returned with feedback last November. Here's
the link to their email in case you wanted to read some comments their
proposal and patch got, although some of them might be obsolete.

/messages/by-id/2401607.SfZhPQhbS4@ronan_laptop

Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
times to be doing this. Hopefully, the dust has settled a little bit
now.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#3)

Re: Ordered Partitioned Table Scans

Hi,

On Fri, Oct 26, 2018 at 6:40 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
last year, but the partitioning-related planning code hadn't advanced then
as much as it has today, so they sort of postponed working on it.
Eventually their patch was returned with feedback last November. Here's
the link to their email in case you wanted to read some comments their
proposal and patch got, although some of them might be obsolete.

/messages/by-id/2401607.SfZhPQhbS4@ronan_laptop

Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
times to be doing this. Hopefully, the dust has settled a little bit
now.

Yes, back then I unfortunately had a limited time to work on that, and
I had to spend all of it rebasing the patch instead of working on the
various issue :(

Sadly, I have even less time now, but I'll try to look at your patch
this weekend! As far as I remember, the biggest problems we had was
to handle multi-level partitionning, when the query is ordered by all
or a subset of the partition keys, and/or with a mix of ASC/DESC
clauses. It also required some extra processing on the cost part for
queries that can be naturally ordered and contain a LIMIT clause,
since we can estimate how many partitions will have to be scanned.

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: Julien Rouhaud (#4)

Re: Ordered Partitioned Table Scans

Hi,

On Fri, Oct 26, 2018 at 1:01 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Oct 26, 2018 at 6:40 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 26 October 2018 at 16:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I recall Ronan Dunklau and Julien Rouhaud had proposed a patch for this
last year, but the partitioning-related planning code hadn't advanced then
as much as it has today, so they sort of postponed working on it.
Eventually their patch was returned with feedback last November. Here's
the link to their email in case you wanted to read some comments their
proposal and patch got, although some of them might be obsolete.

/messages/by-id/2401607.SfZhPQhbS4@ronan_laptop

Thanks. I wasn't aware, or ... forgot. Looks like back then was tricky
times to be doing this. Hopefully, the dust has settled a little bit
now.

As far as I remember, the biggest problems we had was
to handle multi-level partitionning, when the query is ordered by all
or a subset of the partition keys, and/or with a mix of ASC/DESC
clauses. It also required some extra processing on the cost part for
queries that can be naturally ordered and contain a LIMIT clause,
since we can estimate how many partitions will have to be scanned.

I just had a look at your patch. I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions). This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

Concerning the implementation, there's at least one issue: it assumes
that each subpath of a range-partitioned table will be ordered, with
is not guaranteed. You need to to generate explicit Sort nodes nodes
(in the same order as the query_pathkey) for partitions that don't
have an ordered path and make sure that this path is used in the
Append. Here's a simplistic case showing the issue (sorry, the
partition names are poorly chosen):

CREATE TABLE simple (id integer, val text) PARTITION BY RANGE (id);
CREATE TABLE simple_1_2 PARTITION OF simple FOR VALUES FROM (1) TO (100000);
CREATE TABLE simple_2_3 PARTITION OF simple FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE simple_0_1 PARTITION OF simple FOR VALUES FROM (-100000) TO (1);

INSERT INTO simple SELECT id, 'line ' || id FROM
generate_series(-19999, 199999) id;

CREATE INDEX ON simple_1_2 (id);
CREATE INDEX ON simple_2_3 (id);

EXPLAIN SELECT * FROM simple ORDER BY id ;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Append (cost=0.00..7705.56 rows=219999 width=15)
-> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
-> Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
-> Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(4 rows)

Also, if a LIMIT is specified, it should be able to give better
estimates, at least if there's no qual. For instance:

EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
QUERY PLAN

------------------------------------------------------------------------------------------------------->
Limit (cost=0.00..0.35 rows=10 width=15)
-> Append (cost=0.00..7705.56 rows=219999 width=15)
-> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
-> Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
-> Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(5 rows)

In this case, we should estimate that the SeqScan (or in a corrected
version the Sort) node should not return more than 10 rows, and each
following partition should be scanned at all, and cost each path
accordingly. I think that this is quite important, for instance to
make sure that natively sorted Append is chosen over a MergeAppend
when there are some subpath with explicit sorts, because with the
Append we probably won't have to execute all the sorts if the previous
partition scans returned enough rows.

FWIW, both those cases were handled (probably with some bugs though)
in the previous patches Ronan and I sent some time ago. Also, I did
not forget about this feature, I planned to work on it in hope to have
it included in pg12. However, I won't have a lot of time to work on
it before December.

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#5)

1 attachment(s)

Re: Ordered Partitioned Table Scans

Thanks for looking at this.

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:

I just had a look at your patch. I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions). This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

I had thought about these cases but originally had thought they would
be more complex to implement than I could justify. On review, I've
found some pretty cheap ways to handle both sub-partitions and for
LIST partitioned tables. Currently, with LIST partitioned tables I've
coded it to only allow the optimisation if there's no DEFAULT
partition and all partitions are defined with exactly 1 Datum. This
guarantees that there are no interleaved values, but it'll just fail
to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The
reason that I didn't go to the trouble of the additional checks was
that I don't really want to add any per-partition overhead to this.
If RelOptInfo had a Bitmapset of live partitions then we could just
check the partitions that survived pruning. Amit Langote has a
pending patch which does that and some other useful stuff, so maybe we
can delay fixing that until the dust settles a bit in that area. Amit
and I are both working hard to remove all these per-partition
overheads. I imagine he'd also not be in favour of adding code that
does something for all partitions when we've pruned down to just 1.
I've personally no objection to doing the required additional
processing for the non-pruned partitions only. We could also then fix
the case where we disable the optimisation if there's a DEFAULT
partition without any regards to if it's been pruned or not.

Concerning the implementation, there's at least one issue: it assumes
that each subpath of a range-partitioned table will be ordered, with
is not guaranteed. You need to to generate explicit Sort nodes nodes
(in the same order as the query_pathkey) for partitions that don't
have an ordered path and make sure that this path is used in the
Append. Here's a simplistic case showing the issue (sorry, the
partition names are poorly chosen):

Thanks for noticing this. I had been thrown off due to the fact that
Paths are never actually created for these sorts. On looking further I
see that we do checks during createplan to see if the path is
suitability sorted and just create a sort node if it's not. This seems
to go against the whole point of paths, but I'm not going to fight for
changing it, so I've just done the Append the same way as MergeAppend
handles it.

Also, if a LIMIT is specified, it should be able to give better
estimates, at least if there's no qual. For instance:

EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
QUERY PLAN

------------------------------------------------------------------------------------------------------->
Limit (cost=0.00..0.35 rows=10 width=15)
-> Append (cost=0.00..7705.56 rows=219999 width=15)
-> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
-> Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
-> Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(5 rows)

In this case, we should estimate that the SeqScan (or in a corrected
version the Sort) node should not return more than 10 rows, and each
following partition should be scanned at all, and cost each path
accordingly. I think that this is quite important, for instance to
make sure that natively sorted Append is chosen over a MergeAppend
when there are some subpath with explicit sorts, because with the
Append we probably won't have to execute all the sorts if the previous
partition scans returned enough rows.

In my patch, I'm not adding any additional paths. I'm just adding an
Append instead of a MergeAppend. For what you're talking about the
limit only needs to be passed into any underlying Sort so that it can
become a top-N sort. This is handled already in create_limit_path().
Notice in the plan you pasted above that the limit has a lower total
cost than its Append subnode. That's because create_limit_path()
weighted the Limit total cost based on the row count of the limit and
its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

FWIW, both those cases were handled (probably with some bugs though)
in the previous patches Ronan and I sent some time ago. Also, I did
not forget about this feature, I planned to work on it in hope to have
it included in pg12. However, I won't have a lot of time to work on
it before December.

I apologise for not noticing your patch. I only went as far as
checking the November commitfest to see if anything existed already
and I found nothing there. I have time to work on this now, so likely
it's better if I continue, just in case your time in December does not
materialise.

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

The known limitations are:

* Disables the optimisation even if the DEFAULT partition is pruned.
* Disables the optimisation if LIST partitioned tables have any
partitions allowing > 1 value.
* Fails to optimise UNION ALLs with partitioned tables.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v2-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From d4b48ff7c44f832ecbe282a93bdb9c2e53bc5a97 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v2] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 142 ++++++++++++++++++++---
 src/backend/optimizer/path/costsize.c         |  51 ++++++--
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  86 ++++++++++++++
 src/backend/optimizer/plan/createplan.c       |  91 +++++++++++----
 src/backend/optimizer/plan/planner.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  24 +++-
 src/backend/utils/cache/partcache.c           |  10 +-
 src/include/nodes/relation.h                  |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/test/regress/expected/inherit.out         | 160 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 +++++------
 src/test/regress/sql/inherit.sql              |  64 +++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 18 files changed, 620 insertions(+), 101 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69731ccdea..5597dc6154 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1939,6 +1939,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5f74d3b36d..1ae1bf7b9d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -104,6 +104,7 @@ static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1597,7 +1598,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1639,7 +1640,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1689,7 +1690,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
@@ -1751,7 +1752,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
@@ -1786,6 +1787,22 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 						   List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection);
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1794,6 +1811,20 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys match the partition order, or reverse
+		 * partition order.  It can't match both, so only go to the trouble of
+		 * checking the reverse order when it's not in ascending partition
+		 * order.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys);
+		partition_order_desc = !partition_order &&
+								pathkeys_contained_in(pathkeys,
+													partition_pathkeys_desc);
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1836,26 +1867,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build a simple Append path if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1996,6 +2082,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of a Append or MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *)path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *)path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2016,7 +2130,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..e616bc91a4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1837,7 +1837,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1849,21 +1849,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * first subpath. This may be overwritten below if the initial path
+		 * requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 *We'll need to insert a Sort node, so include cost for that
+				 */
+				cost_sort(&sort_path,
+					root,
+					pathkeys,
+					subpath->total_cost,
+					subpath->parent->tuples,
+					subpath->pathtarget->width,
+					0.0,
+					work_mem,
+					apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1871,6 +1906,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index d3d21fed5d..2f9fc50bf2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1231,7 +1231,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..d8268d6e7d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -25,6 +25,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -547,6 +548,91 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Note that for partitions that don't have a
+ *	  natural ordering, or the ordering is too hard to prove, we return NIL.)
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+	PartitionScheme		partscheme;
+	List	   *retval;
+	int			i;
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return NIL;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return NIL;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return NIL;
+			break;
+
+		default:
+			return NIL;
+	}
+
+	retval = NIL;
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		retval = lappend(retval, cpathkey);
+	}
+
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae46b0140e..a276b7f3b1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -201,8 +201,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1025,12 +1023,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1056,6 +1066,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1065,6 +1092,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1107,10 +1168,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5340,23 +5402,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c729a99f8b..78b834032d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3899,6 +3899,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6878,7 +6879,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..6d4657a4c1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -656,7 +656,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -711,7 +711,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -822,7 +822,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d50d86b252..df26297169 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1219,7 +1219,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1253,7 +1253,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1263,10 +1263,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1274,6 +1278,16 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && pathkeys != NULL &&
+		bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1287,7 +1301,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3587,7 +3601,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..deb205c44f 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -937,6 +937,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -950,7 +952,13 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
 										   val1, val2));
 }
 
-/* Used when sorting range bounds across all range partitions */
+/*
+ * qsort_partition_rbound_cmp
+ *
+ * Used when sorting range bounds across all range partitions
+ *
+ * Note: If changing this, see build_partition_pathkeys()
+ */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
 {
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 88d37236f7..5a60fb860d 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1321,6 +1321,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..7cb5644dd3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -110,7 +110,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 81abcf53a8..5a790cf6be 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cafde307ad..ee958a0f07 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 4f29d9f891..6edef9509e 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2032,6 +2032,158 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
 drop table mcrparted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
@@ -2045,17 +2197,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 24313e8c78..d7c268c5af 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3013,14 +3013,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3067,17 +3067,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3090,13 +3088,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3109,12 +3106,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3123,23 +3119,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index a6e541d4da..0b9c784798 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -721,6 +721,70 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
 drop table mcrparted;
 
 -- check that partitioned table Appends cope with being referenced in
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index eca1a7c5ac..a834afd572 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -740,15 +740,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -769,7 +769,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: David Rowley (#6)

Re: Ordered Partitioned Table Scans

On 29 October 2018 at 13:44, David Rowley <david.rowley@2ndquadrant.com> wrote:

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

Added to the November 'fest.

https://commitfest.postgresql.org/20/1850/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#6)

Re: Ordered Partitioned Table Scans

On Mon, Oct 29, 2018 at 1:44 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:

I just had a look at your patch. I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions). This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

I had thought about these cases but originally had thought they would
be more complex to implement than I could justify. On review, I've
found some pretty cheap ways to handle both sub-partitions and for
LIST partitioned tables. Currently, with LIST partitioned tables I've
coded it to only allow the optimisation if there's no DEFAULT
partition and all partitions are defined with exactly 1 Datum. This
guarantees that there are no interleaved values, but it'll just fail
to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The
reason that I didn't go to the trouble of the additional checks was
that I don't really want to add any per-partition overhead to this.

I see, but the overhead you mention is because you're doing that check
during the planning in build_partition_pathkeys(). As advised by
Robert quite some time ago
(/messages/by-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com),
we can store that information when the PartitionDesc is built, so
that would it wouldn't be problematic. Since checking for overlapping
values is straightforward with the BoundInfoData infrastructure, it'd
be a pity to miss this optimization in such cases, which I believe
would not be rare.

If RelOptInfo had a Bitmapset of live partitions then we could just
check the partitions that survived pruning. Amit Langote has a
pending patch which does that and some other useful stuff, so maybe we
can delay fixing that until the dust settles a bit in that area. Amit
and I are both working hard to remove all these per-partition
overheads. I imagine he'd also not be in favour of adding code that
does something for all partitions when we've pruned down to just 1.
I've personally no objection to doing the required additional
processing for the non-pruned partitions only. We could also then fix
the case where we disable the optimisation if there's a DEFAULT
partition without any regards to if it's been pruned or not.

Those are quite worthwhile enhancements, and being able to avoid a
MergeAppend if the problematic partitions have been prune would be
great! I didn't followed thoroughly all the discussions about the
various optimization Amit and you are working on, but I don't think it
would be incompatible with a new flag and the possibility to have the
sorted append with multi valued list partitions?

Concerning the implementation, there's at least one issue: it assumes
that each subpath of a range-partitioned table will be ordered, with
is not guaranteed. You need to to generate explicit Sort nodes nodes
(in the same order as the query_pathkey) for partitions that don't
have an ordered path and make sure that this path is used in the
Append. Here's a simplistic case showing the issue (sorry, the
partition names are poorly chosen):

Thanks for noticing this. I had been thrown off due to the fact that
Paths are never actually created for these sorts. On looking further I
see that we do checks during createplan to see if the path is
suitability sorted and just create a sort node if it's not. This seems
to go against the whole point of paths, but I'm not going to fight for
changing it, so I've just done the Append the same way as MergeAppend
handles it.

Yes, I had quite the same reaction when I saw how MergeAppend handles it.

Also, if a LIMIT is specified, it should be able to give better
estimates, at least if there's no qual. For instance:

EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
QUERY PLAN

------------------------------------------------------------------------------------------------------->
Limit (cost=0.00..0.35 rows=10 width=15)
-> Append (cost=0.00..7705.56 rows=219999 width=15)
-> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15)
-> Index Scan using simple_1_2_id_idx on simple_1_2
(cost=0.29..3148.28 rows=99999 width=14)
-> Index Scan using simple_2_3_id_idx on simple_2_3
(cost=0.29..3148.29 rows=100000 width=16)
(5 rows)

In this case, we should estimate that the SeqScan (or in a corrected
version the Sort) node should not return more than 10 rows, and each
following partition should be scanned at all, and cost each path
accordingly. I think that this is quite important, for instance to
make sure that natively sorted Append is chosen over a MergeAppend
when there are some subpath with explicit sorts, because with the
Append we probably won't have to execute all the sorts if the previous
partition scans returned enough rows.

In my patch, I'm not adding any additional paths. I'm just adding an
Append instead of a MergeAppend. For what you're talking about the
limit only needs to be passed into any underlying Sort so that it can
become a top-N sort. This is handled already in create_limit_path().
Notice in the plan you pasted above that the limit has a lower total
cost than its Append subnode. That's because create_limit_path()
weighted the Limit total cost based on the row count of the limit and
its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

Yes. But the cost of the first partition in this example is wrong
since there was no additional sort on top of the seq scan.

However, I now realize that, as you said, what your patch does is to
generate an Append *instead* of a MergeAppend if the optimization was
possible. So there can't be the problem of a MergeAppend chosen over
a cheaper Append in some cases, sorry for the noise. I totally missed
that because when I worked on the same topic last year we had to
generate both Append and MergeAppend. At that time Append were not
parallel-aware yet, so there could be faster parallel MergeAppend in
some cases.

FWIW, both those cases were handled (probably with some bugs though)
in the previous patches Ronan and I sent some time ago. Also, I did
not forget about this feature, I planned to work on it in hope to have
it included in pg12. However, I won't have a lot of time to work on
it before December.

I apologise for not noticing your patch. I only went as far as
checking the November commitfest to see if anything existed already
and I found nothing there.

No worries, it's more than a year old now (I'm quite ashamed I didn't
come back on this sooner).

I have time to work on this now, so likely
it's better if I continue, just in case your time in December does not
materialise.

I entirely agree.

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

I've registered as a reviewer. I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#8)

Re: Ordered Partitioned Table Scans

On 31 October 2018 at 12:24, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Mon, Oct 29, 2018 at 1:44 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:

I just had a look at your patch. I see that you implemented only a
subset of the possible optimizations (only the case for range
partitionoing without subpartitions). This has been previously
discussed, but we should be able to do similar optimization for list
partitioning if there's no interleaved values, and also for some cases
of multi-level partitioning.

I had thought about these cases but originally had thought they would
be more complex to implement than I could justify. On review, I've
found some pretty cheap ways to handle both sub-partitions and for
LIST partitioned tables. Currently, with LIST partitioned tables I've
coded it to only allow the optimisation if there's no DEFAULT
partition and all partitions are defined with exactly 1 Datum. This
guarantees that there are no interleaved values, but it'll just fail
to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The
reason that I didn't go to the trouble of the additional checks was
that I don't really want to add any per-partition overhead to this.

I see, but the overhead you mention is because you're doing that check
during the planning in build_partition_pathkeys(). As advised by
Robert quite some time ago
(/messages/by-id/CA+TgmobOWgT1=zyjx-q=7s8akXNODix46qG0_-YX7K369P6ADA@mail.gmail.com),
we can store that information when the PartitionDesc is built, so
that would it wouldn't be problematic. Since checking for overlapping
values is straightforward with the BoundInfoData infrastructure, it'd
be a pity to miss this optimization in such cases, which I believe
would not be rare.

Thanks for looking at this again.

I retrospectively read that thread after Amit mentioned about your
patch. I just disagree with Robert about caching this flag. The
reason is, if the flag is false due to some problematic partitions, if
we go and prune those, then we needlessly fail to optimise that case.
I propose we come back and do the remaining optimisations with
interleaved LIST partitions and partitioned tables with DEFAULT
partitions later, once we have a new "live_parts" field in
RelOptInfo. That way we can just check the live parts to ensure
they're compatible with the optimization. If we get what's done
already in then we're already a bit step forward.

[...]

v2 of the patch is attached. I've not had time yet to give it a
throughout post write review, but on first look it seems okay.

I've registered as a reviewer. I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

Thanks for signing up to review. I need to send another revision of
the patch to add a missing call to truncate_useless_pathkeys(). Will
try to do that today.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: David Rowley (#9)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On 31 October 2018 at 13:05, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:

I've registered as a reviewer. I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

Thanks for signing up to review. I need to send another revision of
the patch to add a missing call to truncate_useless_pathkeys(). Will
try to do that today.

I've attached a patch that removes the redundant pathkeys. This allows
cases like the following to work:

explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
QUERY PLAN
-------------------------------------------------------------
Append
-> Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
Index Cond: (a = 10)
-> Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
Index Cond: (a = 10)
(5 rows)

One thing that could work but currently does not are when LIST
partitions just allow a single value, we could allow the Append to
have pathkeys even if there are no indexes. One way to do this would
be to add PathKeys to the seqscan path on the partition for supporting
partitions. However, that's adding code in another area so likely
should be another patch.

This could allow cases like:

create table bool_rp (b bool) partition by list(b);
create table bool_rp_true partition of bool_rp for values in(true);
create table bool_rp_false partition of bool_rp for values in(false);
explain (costs off) select * from bool_rp order by b;
QUERY PLAN
------------------------------------------------------------------
Append
-> Seq Scan on bool_rp_false
-> Seq Scan on bool_rp_true
(3 rows)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v3-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v3-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From 0581029e7ac5baa7a6122d3664e2da36b76bf060 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v3] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 143 ++++++++++++++++++---
 src/backend/optimizer/path/costsize.c         |  51 ++++++--
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  87 +++++++++++++
 src/backend/optimizer/plan/createplan.c       |  91 ++++++++++----
 src/backend/optimizer/plan/planner.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  24 +++-
 src/backend/utils/cache/partcache.c           |  10 +-
 src/include/nodes/relation.h                  |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/test/regress/expected/inherit.out         | 174 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 +++++-----
 src/test/regress/sql/inherit.sql              |  71 +++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 18 files changed, 643 insertions(+), 101 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69731ccdea..5597dc6154 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1939,6 +1939,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5f74d3b36d..29fdd4b190 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -104,6 +104,7 @@ static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1597,7 +1598,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1639,7 +1640,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1689,7 +1690,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
@@ -1751,7 +1752,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
@@ -1786,6 +1787,23 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 						   List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1794,6 +1812,20 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys match the partition order, or reverse
+		 * partition order.  It can't match both, so only go to the trouble of
+		 * checking the reverse order when it's not in ascending partition
+		 * order.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys);
+		partition_order_desc = !partition_order &&
+								pathkeys_contained_in(pathkeys,
+													partition_pathkeys_desc);
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1836,26 +1868,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build a simple Append path if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1996,6 +2083,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of a Append or MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *)path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *)path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2016,7 +2131,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..e616bc91a4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1837,7 +1837,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1849,21 +1849,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * first subpath. This may be overwritten below if the initial path
+		 * requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 *We'll need to insert a Sort node, so include cost for that
+				 */
+				cost_sort(&sort_path,
+					root,
+					pathkeys,
+					subpath->total_cost,
+					subpath->parent->tuples,
+					subpath->pathtarget->width,
+					0.0,
+					work_mem,
+					apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1871,6 +1906,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index d3d21fed5d..2f9fc50bf2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1231,7 +1231,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..8c2a5b4f48 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -25,6 +25,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -547,6 +548,92 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Note that for partitions that don't have a
+ *	  natural ordering, or the ordering is too hard to prove, we return NIL.)
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+	PartitionScheme		partscheme;
+	List	   *retval;
+	int			i;
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return NIL;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return NIL;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return NIL;
+			break;
+
+		default:
+			return NIL;
+	}
+
+	retval = NIL;
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		if (cpathkey != NULL && !pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae46b0140e..a276b7f3b1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -201,8 +201,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1025,12 +1023,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1056,6 +1066,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1065,6 +1092,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1107,10 +1168,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5340,23 +5402,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c729a99f8b..78b834032d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3899,6 +3899,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6878,7 +6879,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..6d4657a4c1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -656,7 +656,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -711,7 +711,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -822,7 +822,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d50d86b252..df26297169 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1219,7 +1219,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1253,7 +1253,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1263,10 +1263,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1274,6 +1278,16 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && pathkeys != NULL &&
+		bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1287,7 +1301,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3587,7 +3601,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..deb205c44f 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -937,6 +937,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -950,7 +952,13 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
 										   val1, val2));
 }
 
-/* Used when sorting range bounds across all range partitions */
+/*
+ * qsort_partition_rbound_cmp
+ *
+ * Used when sorting range bounds across all range partitions
+ *
+ * Note: If changing this, see build_partition_pathkeys()
+ */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
 {
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 88d37236f7..5a60fb860d 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1321,6 +1321,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..7cb5644dd3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -110,7 +110,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 81abcf53a8..5a790cf6be 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cafde307ad..ee958a0f07 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 4f29d9f891..ec2948bee7 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2032,6 +2032,172 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
@@ -2045,17 +2211,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 24313e8c78..d7c268c5af 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3013,14 +3013,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3067,17 +3067,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3090,13 +3088,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3109,12 +3106,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3123,23 +3119,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index a6e541d4da..0ef312fa29 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -721,6 +721,77 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
 -- check that partitioned table Appends cope with being referenced in
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index eca1a7c5ac..a834afd572 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -740,15 +740,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -769,7 +769,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#11

Antonin Houska

ah@cybertec.at

about 7 years ago

In reply to: David Rowley (#10)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> wrote:

On 31 October 2018 at 13:05, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 28 October 2018 at 03:49, Julien Rouhaud <rjuju123@gmail.com> wrote:

I've registered as a reviewer. I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

Thanks for signing up to review. I need to send another revision of
the patch to add a missing call to truncate_useless_pathkeys(). Will
try to do that today.

I've attached a patch that ...

I've picked this one when looking around what I could review.

* As for the logic, I found generate_mergeappend_paths() to be the most
interesting part:

Imagine table partitioned by "i", so "partition_pathkeys" is {i}.

partition 1:

i | j
--+--
0 | 0
1 | 1
0 | 1
1 | 0

partition 2:

i | j
--+--
3 | 0
2 | 0
2 | 1
3 | 1

Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
ordering of the subpaths should not change the way tuples are split into
partitions.

Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
different items. To propose more generic rule, I used this example of
range-partitioned table, where "i" and "j" are the partitioning keys:

partition 1:

i | j | k
---+---+---
0 | 0 | 1
0 | 0 | 0

partition 2:

i | j | k
---+---+---
0 | 1 | 0
0 | 1 | 1

If the output "pathkey" is {i, k}, then the Append path makes rows of both
partitions interleave:

i | j | k
---+---+---
0 | 0 | 0
0 | 1 | 0
0 | 0 | 1
0 | 1 | 1

So in general I think the restriction is that no valid position of "pathkeys"
and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
pathkey lists must be contained in the longer one. Does it make sense to you?

Another problem I see is that build_partition_pathkeys() continues even if it
fails to create a pathkey for some partitioning column. In the example above
it would mean that the table can have "partition_pathkeys" equal to {j}
instead of {i, j} on some EC-related conditions. However such a key does not
correspond to reality - this is easier to imagine if another partition is
considered.

partition 3:

i | j | k
---+---+---
1 | 0 | 1
1 | 0 | 0

So I think no "partition_pathkeys" should be generated in that case. On the
other hand, if the function returned the part of the list it could construct
so far, it'd be wrong because such incomplete pathkeys could pass the checks I
proposed above for reasons unrelated to the partitioning scheme.

The following comments are mostly on coding:

* Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
this sentence in the header comment:

Note: If changing this, see build_partition_pathkeys()

However I could not find other use besides that in
RelationBuildPartitionDesc().

* create_append_path():

/*
* Apply query-wide LIMIT if known and path is for sole base relation.
* (Handling this at this low level is a bit klugy.)
*/
if (root != NULL && pathkeys != NULL &&
bms_equal(rel->relids, root->all_baserels))
pathnode->limit_tuples = root->limit_tuples;
else
pathnode->limit_tuples = -1.0;

I think this optimization is not specific to AppendPath / MergeAppendPath,
so it could be moved elsewhere (as a separate patch of course). But
specifically for AppendPath, why do we have to test pathkeys? The pathkeys
of the AppendPath do not necessarily describe the order of the set to which
LIMIT is applied, so their existence should not be important here.

* If pathkeys is passed, shouldn't create_append_path() include the
cost_sort() of subpaths just like create_merge_append_path() does? And if
so, then create_append_path() and create_merge_append_path() might
eventually have some common code (at least for the subpath processing) to be
put into a separate function.

* Likewise, create_append_plan() / create_merge_append_plan() are going to be
more similar then before, so some refactoring could also make sense.

Although it's not too much code, I admit the patch is not trivial, so I'm
curious about your opinion.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

#12

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Antonin Houska (#11)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On 1 November 2018 at 04:01, Antonin Houska <ah@cybertec.at> wrote:

* As for the logic, I found generate_mergeappend_paths() to be the most
interesting part:

Imagine table partitioned by "i", so "partition_pathkeys" is {i}.

partition 1:

i | j
--+--
0 | 0
1 | 1
0 | 1
1 | 0

partition 2:

i | j
--+--
3 | 0
2 | 0
2 | 1
3 | 1

Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
ordering of the subpaths should not change the way tuples are split into
partitions.

Obviously a problem is if "partition_pathkeys" and "pathkeys" lists start with
different items. To propose more generic rule, I used this example of
range-partitioned table, where "i" and "j" are the partitioning keys:

partition 1:

i | j | k
---+---+---
0 | 0 | 1
0 | 0 | 0

partition 2:

i | j | k
---+---+---
0 | 1 | 0
0 | 1 | 1

If the output "pathkey" is {i, k}, then the Append path makes rows of both
partitions interleave:

i | j | k
---+---+---
0 | 0 | 0
0 | 1 | 0
0 | 0 | 1
0 | 1 | 1

So in general I think the restriction is that no valid position of "pathkeys"
and "partition_pathkeys" may differ. Or in other words: the shorter of the 2
pathkey lists must be contained in the longer one. Does it make sense to you?

I understand what you're saying. I just don't understand what you
think is wrong with the patch in this area.

Another problem I see is that build_partition_pathkeys() continues even if it
fails to create a pathkey for some partitioning column. In the example above
it would mean that the table can have "partition_pathkeys" equal to {j}
instead of {i, j} on some EC-related conditions. However such a key does not
correspond to reality - this is easier to imagine if another partition is
considered.

partition 3:

i | j | k
---+---+---
1 | 0 | 1
1 | 0 | 0

So I think no "partition_pathkeys" should be generated in that case. On the
other hand, if the function returned the part of the list it could construct
so far, it'd be wrong because such incomplete pathkeys could pass the checks I
proposed above for reasons unrelated to the partitioning scheme.

Oops. That's a mistake. We should return what we have so far if we
can't make one of the pathkeys. Will fix.

The following comments are mostly on coding:

* Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
this sentence in the header comment:

Note: If changing this, see build_partition_pathkeys()

However I could not find other use besides that in
RelationBuildPartitionDesc().

While the new code does not call those directly, the new code does
depend on the sort order of the partitions inside the PartitionDesc,
which those functions are responsible for. Perhaps there's a better
way to communicate that.

Actually, I think the partitioning checking code I added to pathkeys.c
does not belong there. Likely those checks should live with the other
partitioning code in the form of a bool returning function. I'll
change that now. It means we don't have to work that out twice as I'm
currently running it once for forward and once for the backwards scan
case. Currently the code is very simple but if we start analysing
list partition bounds then it will become slower.

* create_append_path():

/*
* Apply query-wide LIMIT if known and path is for sole base relation.
* (Handling this at this low level is a bit klugy.)
*/
if (root != NULL && pathkeys != NULL &&
bms_equal(rel->relids, root->all_baserels))
pathnode->limit_tuples = root->limit_tuples;
else
pathnode->limit_tuples = -1.0;

I think this optimization is not specific to AppendPath / MergeAppendPath,
so it could be moved elsewhere (as a separate patch of course). But
specifically for AppendPath, why do we have to test pathkeys? The pathkeys
of the AppendPath do not necessarily describe the order of the set to which
LIMIT is applied, so their existence should not be important here.

The pathkeys != NULL could be removed. I was just trying to maintain
the status quo for Appends without pathkeys. In reality it currently
does not matter since that's only used as a parameter for cost_sort().
There'd be no reason previously to have a Sort path as a subpath in an
Append node since the order would be destroyed after the Append.
Perhaps we should just pass it through as one day it might be useful.
I just can't currently imagine why.

* If pathkeys is passed, shouldn't create_append_path() include the
cost_sort() of subpaths just like create_merge_append_path() does? And if
so, then create_append_path() and create_merge_append_path() might
eventually have some common code (at least for the subpath processing) to be
put into a separate function.

It does. It's just done via the call to cost_append().

* Likewise, create_append_plan() / create_merge_append_plan() are going to be
more similar then before, so some refactoring could also make sense.

Although it's not too much code, I admit the patch is not trivial, so I'm
curious about your opinion.

I think the costing code is sufficiently different to warant not
sharing more. For example, the startup costing is completely
different. Append can start on the startup cost of the first subpath,
but MergeAppend takes the sum of the startup cost of all subpaths.

I've attached v4 of the patch. I think this addresses all that you
mentioned apart from the first one, due to not understanding the
problem.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v4-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v4-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From c44c8a94a8f91767e97ca90d073be939e33adcd2 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v4] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 144 ++++++++++++++++++--
 src/backend/optimizer/path/costsize.c         |  51 ++++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++++++
 src/backend/optimizer/plan/createplan.c       |  91 +++++++++----
 src/backend/optimizer/plan/planner.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 +++-
 src/backend/partitioning/partprune.c          |  59 ++++++++
 src/backend/utils/cache/partcache.c           |  10 +-
 src/include/nodes/relation.h                  |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   1 +
 src/test/regress/expected/inherit.out         | 188 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++-----
 src/test/regress/sql/inherit.sql              |  81 +++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 694 insertions(+), 101 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69731ccdea..5597dc6154 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1939,6 +1939,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5f74d3b36d..6205828656 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -104,6 +104,7 @@ static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1597,7 +1598,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1639,7 +1640,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1689,7 +1690,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
@@ -1751,7 +1752,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
@@ -1786,6 +1787,24 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 						   List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1794,6 +1813,20 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys match the partition order, or reverse
+		 * partition order.  It can't match both, so only go to the trouble of
+		 * checking the reverse order when it's not in ascending partition
+		 * order.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys);
+		partition_order_desc = !partition_order &&
+								pathkeys_contained_in(pathkeys,
+													partition_pathkeys_desc);
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1836,26 +1869,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build a simple Append path if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1996,6 +2084,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of a Append or MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *)path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *)path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2016,7 +2132,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..e616bc91a4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1837,7 +1837,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1849,21 +1849,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * first subpath. This may be overwritten below if the initial path
+		 * requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 *We'll need to insert a Sort node, so include cost for that
+				 */
+				cost_sort(&sort_path,
+					root,
+					pathkeys,
+					subpath->total_cost,
+					subpath->parent->tuples,
+					subpath->pathtarget->width,
+					0.0,
+					work_mem,
+					apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1871,6 +1906,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index d3d21fed5d..2f9fc50bf2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1231,7 +1231,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..f225541751 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -25,6 +25,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -547,6 +548,59 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.)
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+			break;
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ae46b0140e..a276b7f3b1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -201,8 +201,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1025,12 +1023,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1056,6 +1066,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1065,6 +1092,40 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1107,10 +1168,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5340,23 +5402,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c729a99f8b..78b834032d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3899,6 +3899,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6878,7 +6879,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..6d4657a4c1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -656,7 +656,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -711,7 +711,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -822,7 +822,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d50d86b252..ca021fca8d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1219,7 +1219,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1253,7 +1253,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1263,10 +1263,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1274,6 +1278,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1287,7 +1300,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3587,7 +3600,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index d6ca03de4a..d21f3ebdb5 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -178,7 +178,66 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
 
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+	default:
+		return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..deb205c44f 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -937,6 +937,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -950,7 +952,13 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
 										   val1, val2));
 }
 
-/* Used when sorting range bounds across all range partitions */
+/*
+ * qsort_partition_rbound_cmp
+ *
+ * Used when sorting range bounds across all range partitions
+ *
+ * Note: If changing this, see build_partition_pathkeys()
+ */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
 {
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 88d37236f7..5a60fb860d 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1321,6 +1321,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..7cb5644dd3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -110,7 +110,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 81abcf53a8..5a790cf6be 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cafde307ad..ee958a0f07 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index e07aaaf798..bc02b1bacb 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -71,6 +71,7 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(PlannerInfo *root,
 						 RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 4f29d9f891..a0ef0e18b3 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2032,7 +2032,187 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+explain (costs off) select * from bool_rp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_idx on bool_rp_false
+   ->  Index Only Scan using bool_rp_true_b_idx on bool_rp_true
+(3 rows)
+
+drop table bool_rp;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2045,17 +2225,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 24313e8c78..d7c268c5af 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3013,14 +3013,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3067,17 +3067,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3090,13 +3088,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3109,12 +3106,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3123,23 +3119,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index a6e541d4da..a1416b2240 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -721,8 +721,89 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+
+explain (costs off) select * from bool_rp order by b;
+
+drop table bool_rp;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index eca1a7c5ac..a834afd572 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -740,15 +740,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -769,7 +769,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#13

Antonin Houska

ah@cybertec.at

about 7 years ago

In reply to: David Rowley (#12)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> wrote:

On 1 November 2018 at 04:01, Antonin Houska <ah@cybertec.at> wrote:

* As for the logic, I found generate_mergeappend_paths() to be the most
interesting part:

Imagine table partitioned by "i", so "partition_pathkeys" is {i}.

partition 1:

i | j
--+--
0 | 0
1 | 1
0 | 1
1 | 0

partition 2:

i | j
--+--
3 | 0
2 | 0
2 | 1
3 | 1

Even if "pathkeys" is {i, j}, i.e. not contained in "partition_pathkeys", the
ordering of the subpaths should not change the way tuples are split into
partitions.

...

I understand what you're saying. I just don't understand what you
think is wrong with the patch in this area.

I think these conditions are too restrictive:

/*
* Determine if these pathkeys match the partition order, or reverse
* partition order. It can't match both, so only go to the trouble of
* checking the reverse order when it's not in ascending partition
* order.
*/
partition_order = pathkeys_contained_in(pathkeys,
partition_pathkeys);
partition_order_desc = !partition_order &&
pathkeys_contained_in(pathkeys,
partition_pathkeys_desc);

In the example above I wanted to show that your new feature should work even
if "pathkeys" is not contained in "partition_pathkeys".

Another problem I see is that build_partition_pathkeys() continues even if it
fails to create a pathkey for some partitioning column. In the example above
it would mean that the table can have "partition_pathkeys" equal to {j}
instead of {i, j} on some EC-related conditions. However such a key does not
correspond to reality - this is easier to imagine if another partition is
considered.

partition 3:

i | j | k
---+---+---
1 | 0 | 1
1 | 0 | 0

So I think no "partition_pathkeys" should be generated in that case. On the
other hand, if the function returned the part of the list it could construct
so far, it'd be wrong because such incomplete pathkeys could pass the checks I
proposed above for reasons unrelated to the partitioning scheme.

Oops. That's a mistake. We should return what we have so far if we
can't make one of the pathkeys. Will fix.

I think no "partition_pathkeys" should be created in this case, but before we
can discuss this in detail there needs to be an agreement on the evaluation of
the relationship between "pathkeys" and "partition_pathkeys", see above.

The following comments are mostly on coding:

* Both qsort_partition_list_value_cmp() and qsort_partition_rbound_cmp() have
this sentence in the header comment:

Note: If changing this, see build_partition_pathkeys()

However I could not find other use besides that in
RelationBuildPartitionDesc().

While the new code does not call those directly, the new code does
depend on the sort order of the partitions inside the PartitionDesc,
which those functions are responsible for. Perhaps there's a better
way to communicate that.

I pointed this out because I suspect that changes of these functions would
affect more features, not only the one you're trying to implement.

* If pathkeys is passed, shouldn't create_append_path() include the
cost_sort() of subpaths just like create_merge_append_path() does? And if
so, then create_append_path() and create_merge_append_path() might
eventually have some common code (at least for the subpath processing) to be
put into a separate function.

It does. It's just done via the call to cost_append().

ok, I missed that.

* Likewise, create_append_plan() / create_merge_append_plan() are going to be
more similar then before, so some refactoring could also make sense.

Although it's not too much code, I admit the patch is not trivial, so I'm
curious about your opinion.

I think the costing code is sufficiently different to warant not
sharing more. For example, the startup costing is completely
different. Append can start on the startup cost of the first subpath,
but MergeAppend takes the sum of the startup cost of all subpaths.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

#14

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Antonin Houska (#13)

Re: Ordered Partitioned Table Scans

On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:

I think these conditions are too restrictive:

/*
* Determine if these pathkeys match the partition order, or reverse
* partition order. It can't match both, so only go to the trouble of
* checking the reverse order when it's not in ascending partition
* order.
*/
partition_order = pathkeys_contained_in(pathkeys,
partition_pathkeys);
partition_order_desc = !partition_order &&
pathkeys_contained_in(pathkeys,
partition_pathkeys_desc);

In the example above I wanted to show that your new feature should work even
if "pathkeys" is not contained in "partition_pathkeys".

Okay, after a bit more time looking at this I see what you're saying
now and I agree, but; see below.

Another problem I see is that build_partition_pathkeys() continues even if it
fails to create a pathkey for some partitioning column. In the example above
it would mean that the table can have "partition_pathkeys" equal to {j}
instead of {i, j} on some EC-related conditions. However such a key does not
correspond to reality - this is easier to imagine if another partition is
considered.

partition 3:

i | j | k
---+---+---
1 | 0 | 1
1 | 0 | 0

So I think no "partition_pathkeys" should be generated in that case. On the
other hand, if the function returned the part of the list it could construct
so far, it'd be wrong because such incomplete pathkeys could pass the checks I
proposed above for reasons unrelated to the partitioning scheme.

Oops. That's a mistake. We should return what we have so far if we
can't make one of the pathkeys. Will fix.

I think no "partition_pathkeys" should be created in this case, but before we
can discuss this in detail there needs to be an agreement on the evaluation of
the relationship between "pathkeys" and "partition_pathkeys", see above.

The problem with doing that is that if the partition keys are better
than the pathkeys then we'll most likely fail to generate any
partition keys at all due to lack of any existing eclass to use for
the pathkeys. It's unsafe to use just the prefix in this case as the
eclass may not have been found due to, for example one of the
partition keys having a different collation than the required sort
order of the query. In other words, we can't rely on a failure to
create the pathkey meaning that a more strict sort order is not
required.

I'm a bit unsure on how safe it would be to pass "create_it" as true
to make_pathkey_from_sortinfo(). We might be building partition path
keys for some sub-partitioned table. In this case the eclass should
likely have a its member added with em_is_child = true. The existing
code always sets em_is_child to false. It's not that clear to me that
setting up a new eclass with a single em_is_child = true member is
correct.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: David Rowley (#14)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:

I think these conditions are too restrictive:

/*
* Determine if these pathkeys match the partition order, or reverse
* partition order. It can't match both, so only go to the trouble of
* checking the reverse order when it's not in ascending partition
* order.
*/
partition_order = pathkeys_contained_in(pathkeys,
partition_pathkeys);
partition_order_desc = !partition_order &&
pathkeys_contained_in(pathkeys,
partition_pathkeys_desc);

The problem with doing that is that if the partition keys are better
than the pathkeys then we'll most likely fail to generate any
partition keys at all due to lack of any existing eclass to use for
the pathkeys. It's unsafe to use just the prefix in this case as the
eclass may not have been found due to, for example one of the
partition keys having a different collation than the required sort
order of the query. In other words, we can't rely on a failure to
create the pathkey meaning that a more strict sort order is not
required.

I had another look at this patch and it seems okay just to add a new
flag to build_partition_pathkeys() to indicate if the pathkey List was
truncated or not. In generate_mergeappend_paths() we can then just
check that flag before checking if the partiiton pathkeys are
contained in pathkeys. It's fine if the partition keys were truncated
for the reverse of that check.

I've done this in the attached and added additional regression tests
for this case.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v5-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v5-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From f961579d98ed11ceaecccf107ff3f781bf4ebed9 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v5] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 157 +++++++++++++++++--
 src/backend/optimizer/path/costsize.c         |  51 ++++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  59 +++++++
 src/backend/optimizer/plan/createplan.c       |  90 ++++++++---
 src/backend/optimizer/plan/planner.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          |  59 +++++++
 src/include/nodes/relation.h                  |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   1 +
 src/test/regress/expected/inherit.out         | 211 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++----
 src/test/regress/sql/inherit.sql              |  93 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 741 insertions(+), 100 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f0c396530d..6e8c488c78 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1939,6 +1939,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 738bb30848..280b207cb4 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -104,6 +104,7 @@ static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1638,7 +1639,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1680,7 +1681,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1730,7 +1731,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
@@ -1792,7 +1793,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
@@ -1827,6 +1828,28 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 						   List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1835,6 +1858,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1877,26 +1923,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -2037,6 +2138,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2057,7 +2186,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7bf67a0529..b941f79b80 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1837,7 +1837,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1849,21 +1849,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath. This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1871,6 +1906,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index d3d21fed5d..2f9fc50bf2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1231,7 +1231,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..6b59dae755 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -25,6 +25,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -547,6 +548,64 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index da7a92081a..96bfae9266 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -201,8 +201,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1025,12 +1023,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1056,6 +1066,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1065,6 +1092,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1107,10 +1167,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5340,23 +5401,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c729a99f8b..78b834032d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3899,6 +3899,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6878,7 +6879,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 2a1c1cb2e1..45101a3822 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -656,7 +656,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -711,7 +711,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -822,7 +822,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d50d86b252..ca021fca8d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1219,7 +1219,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1253,7 +1253,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1263,10 +1263,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1274,6 +1278,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1287,7 +1300,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3587,7 +3600,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index eeaab2f4c9..bb65304628 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 35c87535d3..8b51e9245c 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -178,7 +178,66 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
 
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6fd24203dd..b8c91490c8 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1321,6 +1321,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..7cb5644dd3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -110,7 +110,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 81abcf53a8..5a790cf6be 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cafde307ad..c03a98c8c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index e07aaaf798..bc02b1bacb 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -71,6 +71,7 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(PlannerInfo *root,
 						 RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f259d07535..8e0b0d2e1f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2001,7 +2001,210 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+explain (costs off) select * from bool_rp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_idx on bool_rp_false
+   ->  Index Only Scan using bool_rp_true_b_idx on bool_rp_true
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2014,17 +2217,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 120b651bf5..32a7d7b93e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 425052c1f4..58b9a055d5 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -700,8 +700,101 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+
+explain (costs off) select * from bool_rp order by b;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#16

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#15)

Re: Ordered Partitioned Table Scans

Hi,

On Thu, Nov 22, 2018 at 11:27 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:

I think these conditions are too restrictive:

/*
* Determine if these pathkeys match the partition order, or reverse
* partition order. It can't match both, so only go to the trouble of
* checking the reverse order when it's not in ascending partition
* order.
*/
partition_order = pathkeys_contained_in(pathkeys,
partition_pathkeys);
partition_order_desc = !partition_order &&
pathkeys_contained_in(pathkeys,
partition_pathkeys_desc);

The problem with doing that is that if the partition keys are better
than the pathkeys then we'll most likely fail to generate any
partition keys at all due to lack of any existing eclass to use for
the pathkeys. It's unsafe to use just the prefix in this case as the
eclass may not have been found due to, for example one of the
partition keys having a different collation than the required sort
order of the query. In other words, we can't rely on a failure to
create the pathkey meaning that a more strict sort order is not
required.

I had another look at this patch and it seems okay just to add a new
flag to build_partition_pathkeys() to indicate if the pathkey List was
truncated or not. In generate_mergeappend_paths() we can then just
check that flag before checking if the partiiton pathkeys are
contained in pathkeys. It's fine if the partition keys were truncated
for the reverse of that check.

I've done this in the attached and added additional regression tests
for this case.

I started to look at v5.

If I understand correctly, the new behavior is controlled by
partitions_are_ordered(), but it only checks for declared partitions,
not partitions that survived pruning. Did I miss something or is it
the intended behavior? Also, generate_mergeappend_paths should
probably be renamed to something like generate_sortedappend_paths
since it can now generate either Append or MergeAppend paths.

I'm also wondering if that's ok to only generate either a (sorted)
Append or a MergeAppend. Is it possible that in some cases it's
better to have a MergeAppend rather than a sorted Append, given that
MergeAppend is parallel-aware and the sorted Append isn't?

#17

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#16)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <rjuju123@gmail.com> wrote:

I started to look at v5.

Thanks for giving this a look over.

If I understand correctly, the new behavior is controlled by
partitions_are_ordered(), but it only checks for declared partitions,
not partitions that survived pruning. Did I miss something or is it
the intended behavior?

Yeah, it was mentioned up-thread a bit.

I wrote:

I retrospectively read that thread after Amit mentioned about your
patch. I just disagree with Robert about caching this flag. The
reason is, if the flag is false due to some problematic partitions, if
we go and prune those, then we needlessly fail to optimise that case.
I propose we come back and do the remaining optimisations with
interleaved LIST partitions and partitioned tables with DEFAULT
partitions later, once we have a new "live_parts" field in
RelOptInfo. That way we can just check the live parts to ensure
they're compatible with the optimization. If we get what's done
already in then we're already a bit step forward.

The reason I'm keen to leave this alone today is that determining
which partitions are pruned requires looking at each partition's
RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
minimise the overhead of this patch by avoiding doing any
per-partition processing. If we get the "live_parts" Bitmapset, then
this becomes cheaper as Bitmapsets are fairly efficient at finding the
next set member, even when they're large and sparsely populated.

Also, generate_mergeappend_paths should
probably be renamed to something like generate_sortedappend_paths
since it can now generate either Append or MergeAppend paths.

You might be right about naming this something else, but
"sortedappend" sounds like an Append node with a Sort node above it.
"orderedappend" feels slightly better, although my personal vote would
be not to rename it at all. Sometimes generating an Append seems like
an easy enough corner case to mention in the function body.

I'm also wondering if that's ok to only generate either a (sorted)
Append or a MergeAppend. Is it possible that in some cases it's
better to have a MergeAppend rather than a sorted Append, given that
MergeAppend is parallel-aware and the sorted Append isn't?

That might have been worth a thought if we had parallel MergeAppends,
but we don't. You might be thinking of GatherMerge.

I've attached a v6 patch. The only change is the renamed the
generate_mergeappend_paths() function to
generate_orderedappend_paths(), and also the required comment updates
to go with it.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v6-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v6-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From 327c46c9b50dc73a737ff14428fce89c0108401b Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v6] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 209 ++++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 ++++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  59 +++++++
 src/backend/optimizer/plan/createplan.c       |  90 ++++++++---
 src/backend/optimizer/plan/planner.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          |  59 +++++++
 src/include/nodes/relation.h                  |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   1 +
 src/test/regress/expected/inherit.out         | 211 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++----
 src/test/regress/sql/inherit.sql              |  93 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 769 insertions(+), 124 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6edc7f2359..c2cb708e33 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1940,6 +1940,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 738bb30848..e28c06343a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -95,15 +95,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1638,7 +1639,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1680,7 +1681,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1730,19 +1731,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1792,41 +1793,67 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
+ * cheapest total paths, and build a suitable path for each case.
  *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1835,6 +1862,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1877,26 +1927,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -2037,6 +2142,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2057,7 +2190,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 480fd250e9..bd80765b3c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1837,7 +1837,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1849,21 +1849,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath. This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1871,6 +1906,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index d3d21fed5d..2f9fc50bf2 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1231,7 +1231,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ec66cb9c3c..6b59dae755 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -25,6 +25,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/tlist.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -547,6 +548,64 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 91cf78233d..76a73cf5e9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -201,8 +201,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1025,12 +1023,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1056,6 +1066,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1065,6 +1092,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1107,10 +1167,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5364,23 +5425,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b645648559..f87fb68a90 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3899,6 +3899,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6935,7 +6936,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index da278f785e..fb66911b99 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -656,7 +656,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -711,7 +711,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -822,7 +822,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d50d86b252..ca021fca8d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1219,7 +1219,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1253,7 +1253,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1263,10 +1263,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1274,6 +1278,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1287,7 +1300,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3587,7 +3600,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index eeaab2f4c9..bb65304628 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 35c87535d3..8b51e9245c 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -178,7 +178,66 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
 
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6fd24203dd..b8c91490c8 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -1321,6 +1321,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 77ca7ff837..7cb5644dd3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -110,7 +110,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 81abcf53a8..5a790cf6be 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index cafde307ad..c03a98c8c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index e07aaaf798..bc02b1bacb 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -71,6 +71,7 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(PlannerInfo *root,
 						 RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f259d07535..8e0b0d2e1f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2001,7 +2001,210 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+explain (costs off) select * from bool_rp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_idx on bool_rp_false
+   ->  Index Only Scan using bool_rp_true_b_idx on bool_rp_true
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2014,17 +2217,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 120b651bf5..32a7d7b93e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 425052c1f4..58b9a055d5 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -700,8 +700,101 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+
+explain (costs off) select * from bool_rp order by b;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#18

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#17)

Re: Ordered Partitioned Table Scans

On Wed, Dec 19, 2018 at 10:51 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Wed, 19 Dec 2018 at 20:40, Julien Rouhaud <rjuju123@gmail.com> wrote:

If I understand correctly, the new behavior is controlled by
partitions_are_ordered(), but it only checks for declared partitions,
not partitions that survived pruning. Did I miss something or is it
the intended behavior?

Yeah, it was mentioned up-thread a bit.

I wrote:

I retrospectively read that thread after Amit mentioned about your
patch. I just disagree with Robert about caching this flag. The
reason is, if the flag is false due to some problematic partitions, if
we go and prune those, then we needlessly fail to optimise that case.
I propose we come back and do the remaining optimisations with
interleaved LIST partitions and partitioned tables with DEFAULT
partitions later, once we have a new "live_parts" field in
RelOptInfo. That way we can just check the live parts to ensure
they're compatible with the optimization. If we get what's done
already in then we're already a bit step forward.

Ah, sorry I did read this but I misunderstood it. I really need to
catchup what changed for partitioning since pg11 more thoroughly.

The reason I'm keen to leave this alone today is that determining
which partitions are pruned requires looking at each partition's
RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
minimise the overhead of this patch by avoiding doing any
per-partition processing. If we get the "live_parts" Bitmapset, then
this becomes cheaper as Bitmapsets are fairly efficient at finding the
next set member, even when they're large and sparsely populated.

I see. But since for now the optimisation will only be done
considering all partitions, I still think that it's better to store a
bool flag in the PartitionDesc to describe if it's natively ordered or
not, and therefore also handle the case for
non-intervleaved-multi-datums list partitioning. It won't add much
overhead and will benefit way more cases.

We can still revisit that when a live_parts Bitmapset is available in
RelOptInfo (and maybe other flag that say if partitions were pruned or
not, and/or if the default partition was pruned).

Also, generate_mergeappend_paths should
probably be renamed to something like generate_sortedappend_paths
since it can now generate either Append or MergeAppend paths.

You might be right about naming this something else, but
"sortedappend" sounds like an Append node with a Sort node above it.
"orderedappend" feels slightly better, although my personal vote would
be not to rename it at all. Sometimes generating an Append seems like
an easy enough corner case to mention in the function body.

Ok, I don't have a very strong opinion on it and orderedappend sounds
less ambiguous.

I'm also wondering if that's ok to only generate either a (sorted)
Append or a MergeAppend. Is it possible that in some cases it's
better to have a MergeAppend rather than a sorted Append, given that
MergeAppend is parallel-aware and the sorted Append isn't?

That might have been worth a thought if we had parallel MergeAppends,
but we don't. You might be thinking of GatherMerge.

Ah, oups indeed :)

#19

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#18)

Re: Ordered Partitioned Table Scans

On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Dec 19, 2018 at 10:51 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

The reason I'm keen to leave this alone today is that determining
which partitions are pruned requires looking at each partition's
RelOptInfo and checking if it's marked as a dummy rel. I'm trying to
minimise the overhead of this patch by avoiding doing any
per-partition processing. If we get the "live_parts" Bitmapset, then
this becomes cheaper as Bitmapsets are fairly efficient at finding the
next set member, even when they're large and sparsely populated.

I see. But since for now the optimisation will only be done
considering all partitions, I still think that it's better to store a
bool flag in the PartitionDesc to describe if it's natively ordered or
not, and therefore also handle the case for
non-intervleaved-multi-datums list partitioning. It won't add much
overhead and will benefit way more cases.

I'm not really in favour of adding a flag there only to remove it
again once we can more easily determine the pruned partitions.
Remember the flag, because it's stored in the relation cache, must be
set accounting for all partitions. As soon as we want to add smarts
for pruned partitions, then the flag becomes completely useless for
everything. If covering all cases in the first hit is your aim then
the way to go is to add the live_parts field to RelOptInfo in this
patch rather than in Amit's patch in [1]https://commitfest.postgresql.org/21/1778/. I'd much rather add the
pruned partitions smarts as part of another effort. The most likely
cases to benefit from this are already covered by the current patch;
range partitioned tables.

[1]: https://commitfest.postgresql.org/21/1778/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#19)

Re: Ordered Partitioned Table Scans

On Wed, Dec 19, 2018 at 1:18 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Wed, 19 Dec 2018 at 23:25, Julien Rouhaud <rjuju123@gmail.com> wrote:

I see. But since for now the optimisation will only be done
considering all partitions, I still think that it's better to store a
bool flag in the PartitionDesc to describe if it's natively ordered or
not, and therefore also handle the case for
non-intervleaved-multi-datums list partitioning. It won't add much
overhead and will benefit way more cases.

I'm not really in favour of adding a flag there only to remove it
again once we can more easily determine the pruned partitions.
Remember the flag, because it's stored in the relation cache, must be
set accounting for all partitions. As soon as we want to add smarts
for pruned partitions, then the flag becomes completely useless for
everything.

I don't see why we should drop this flag. If we know that the
partitions are naturally ordered, they'll still be ordered after some
partitions have been prune, so we can skip later checks if we already
have the information. The only remaining cases this flag doesn't
cover are:

- partitions are naturally ordered but there's a default partition.
We could store this information and later check if the default
partition has been pruned or not
- partitions are not naturally ordered, but become naturally ordered
if enough partitions are pruned. I may be wrong but that doesn't seem
like a very frequent use case to me I'd imagine that in a lot of
cases either almost no partition are prune (or at least not enough so
that the remaining one are ordered), or all but one partition is
pruned),. So keeping a low overhead for the
almost-no-pruned-partition with naturally ordered partitions case
still seems like a good idea to me.

If covering all cases in the first hit is your aim then
the way to go is to add the live_parts field to RelOptInfo in this
patch rather than in Amit's patch in [1]. I'd much rather add the
pruned partitions smarts as part of another effort. The most likely
cases to benefit from this are already covered by the current patch;
range partitioned tables.

Covering all cases is definitely not my goal here, just grabbing the
low hanging fruits. The multi-level partitioning case is another
thing that would need to be handled for instance (and that's the main
reason I couldn't submit a new patch when I was working on it), and
I'm definitely not arguing to cover it in this patch. That being
said, I'll try to have a look at this patch too, but as I said I have
a lot of catch-up to do in this part of the code, so I'm afraid that
I'll not be super efficient.

#21

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#20)

Re: Ordered Partitioned Table Scans

On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote:

I don't see why we should drop this flag. If we know that the
partitions are naturally ordered, they'll still be ordered after some
partitions have been prune, so we can skip later checks if we already
have the information. The only remaining cases this flag doesn't
cover are:

- partitions are naturally ordered but there's a default partition.
We could store this information and later check if the default
partition has been pruned or not
- partitions are not naturally ordered, but become naturally ordered
if enough partitions are pruned. I may be wrong but that doesn't seem
like a very frequent use case to me I'd imagine that in a lot of
cases either almost no partition are prune (or at least not enough so
that the remaining one are ordered), or all but one partition is
pruned),. So keeping a low overhead for the
almost-no-pruned-partition with naturally ordered partitions case
still seems like a good idea to me.

I'm objecting to processing for all partitions, but processing for
just non-pruned partitions seems fine to me. If there are 10k
partitions and we pruned none of them, then planning will be slow
anyway. I'm not too worried about slowing it down a further
microsecond or two. It'll be a drop in the ocean. When we have the
live_parts flag in RelOptInfo then we can allow all of the cases
you've mentioned above, we'll just need to look at the non-pruned
partitions, and in partition order, determine if the lowest LIST
partitioned value sorts earlier than some earlier partition's highest
LIST value and disable the optimisation for such cases.

The flag you've mentioned will become redundant when support is added
for the cases you've mentioned above. I don't see any reason not to
support all these cases, once the live_parts flag makes in into
RelOptInfo. I'm also a bit confused at why you think it's so
important to make multi-valued LIST partitions work when no values are
interleaved, but you suddenly don't care about the optimisation when
the interleaved value partitions get pruned. Can you share your
reasoning for that?

If you're really so keen on this flag, can you share the design you
have in mind? If it's just a single bool flag like "parts_ordered",
and that's set to false, then how would you know there is some natural
order when the DEFAULT partition gets pruned? Or are you proposing
multiple flags, maybe two flags, one for when the default is pruned
and one when it's not? If so, I'd question why the default partition
is so special? Pruning of any of the other partitions could turn a
naturally unordered LIST partitioned table into a naturally ordered
partitioned table if the pruned partition happened to be the only one
with interleaved values. Handling only the DEFAULT partition in a
special way seems to violate the principle of least astonishment.

But in short, I just really don't like the flags idea and I'm not
really willing to work on it or put my name on it. I'd much rather
wait then build a proper solution that works in all cases. I feel the
current patch is worthwhile as it stands.

The multi-level partitioning case is another
thing that would need to be handled for instance (and that's the main
reason I couldn't submit a new patch when I was working on it), and
I'm definitely not arguing to cover it in this patch.

As far as I'm aware, the multi-level partitioning should work just
fine with the current patch. I added code for that a while ago. There
are regression tests to exercise it. I'm not aware of any cases where
it does not work.

That being
said, I'll try to have a look at this patch too, but as I said I have
a lot of catch-up to do in this part of the code, so I'm afraid that
I'll not be super efficient.

Thanks for your time on this so far.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#22

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#21)

Re: Ordered Partitioned Table Scans

On Wed, Dec 19, 2018 at 3:01 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote:

I'm objecting to processing for all partitions, but processing for
just non-pruned partitions seems fine to me. If there are 10k
partitions and we pruned none of them, then planning will be slow
anyway. I'm not too worried about slowing it down a further
microsecond or two. It'll be a drop in the ocean. When we have the
live_parts flag in RelOptInfo then we can allow all of the cases
you've mentioned above, we'll just need to look at the non-pruned
partitions, and in partition order, determine if the lowest LIST
partitioned value sorts earlier than some earlier partition's highest
LIST value and disable the optimisation for such cases.

My concern is more for a more moderate number of partition (a few
hundreds?). I don't know how expensive that'll be, but it just seem
sad to recompute their ordering each time and waste cycles if we can
do it only once in non corner cases.

The flag you've mentioned will become redundant when support is added
for the cases you've mentioned above. I don't see any reason not to
support all these cases, once the live_parts flag makes in into
RelOptInfo. I'm also a bit confused at why you think it's so
important to make multi-valued LIST partitions work when no values are
interleaved, but you suddenly don't care about the optimisation when
the interleaved value partitions get pruned. Can you share your
reasoning for that?

I never said that I don't care about interleaved partition being
pruned. I do think it might not be a super frequent thing, but I
certainly wish we handle it. I just agree with your argument that the
pruned partitions problem will be better handled with the live_parts
that should be added in another patch.

If you're really so keen on this flag, can you share the design you
have in mind? If it's just a single bool flag like "parts_ordered",
and that's set to false, then how would you know there is some natural
order when the DEFAULT partition gets pruned? Or are you proposing
multiple flags, maybe two flags, one for when the default is pruned
and one when it's not?

I don't think that the design is a big problem here. You can either
have a flag that say if the partitions are ordered whether there's a
default partition or not, so callers will have to check if the default
partition is still there, or just store an enum to distinguish the
different cases.

If so, I'd question why the default partition
is so special? Pruning of any of the other partitions could turn a
naturally unordered LIST partitioned table into a naturally ordered
partitioned table if the pruned partition happened to be the only one
with interleaved values. Handling only the DEFAULT partition in a
special way seems to violate the principle of least astonishment.

I'm not sure I'm following you, the default partition is by nature a
special partition, and its simple presence prevent this optimisation.
We can't possibly store all the sets of subsets of partitions that
would make the partitioned table naturally ordered if they were
pruned, so it seems like a different problem.

But in short, I just really don't like the flags idea and I'm not
really willing to work on it or put my name on it. I'd much rather
wait then build a proper solution that works in all cases. I feel the
current patch is worthwhile as it stands.

Ok, fine.

The multi-level partitioning case is another
thing that would need to be handled for instance (and that's the main
reason I couldn't submit a new patch when I was working on it), and
I'm definitely not arguing to cover it in this patch.

As far as I'm aware, the multi-level partitioning should work just
fine with the current patch. I added code for that a while ago. There
are regression tests to exercise it. I'm not aware of any cases where
it does not work.

Ok.

#23

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#22)

Re: Ordered Partitioned Table Scans

On Thu, 20 Dec 2018 at 09:48, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Dec 19, 2018 at 3:01 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

If so, I'd question why the default partition
is so special? Pruning of any of the other partitions could turn a
naturally unordered LIST partitioned table into a naturally ordered
partitioned table if the pruned partition happened to be the only one
with interleaved values. Handling only the DEFAULT partition in a
special way seems to violate the principle of least astonishment.

I'm not sure I'm following you, the default partition is by nature a
special partition, and its simple presence prevent this optimisation.
We can't possibly store all the sets of subsets of partitions that
would make the partitioned table naturally ordered if they were
pruned, so it seems like a different problem.

For example:

create table listp (a int) partition by list (a);
create table listp12 partition of listp for values in(1,2);
create table listp03 partition of listp for vlaues in(0,3);
create table listp45 partition of listp for values in(4,5);
create table listpd partition of listp default;

select * from listp where a in(1,2,4,5);

Here we prune all but listp12 and listp45. Since the default is pruned
and listp03 is pruned then there are no interleaved values. By your
proposed design the natural ordering is not detected since we're
storing a flag that says the partitions are unordered due to listp03.
With my idea for using live_parts, we'll process the partitions
looking for interleaved values on each query, after pruning takes
place. In this case, we'll see the partitions are naturally ordered. I
don't really foresee any issues with that additional processing since
it will only be a big effort when there are a large number of
partitions, and in those cases the planner already has lots of work to
do. Such processing is just a drop in the ocean when compared to path
generation for all those partitions.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#24

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#23)

Re: Ordered Partitioned Table Scans

On Wed, Dec 19, 2018 at 11:08 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Thu, 20 Dec 2018 at 09:48, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Dec 19, 2018 at 3:01 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

If so, I'd question why the default partition
is so special? Pruning of any of the other partitions could turn a
naturally unordered LIST partitioned table into a naturally ordered
partitioned table if the pruned partition happened to be the only one
with interleaved values. Handling only the DEFAULT partition in a
special way seems to violate the principle of least astonishment.

I'm not sure I'm following you, the default partition is by nature a
special partition, and its simple presence prevent this optimisation.
We can't possibly store all the sets of subsets of partitions that
would make the partitioned table naturally ordered if they were
pruned, so it seems like a different problem.

For example:

create table listp (a int) partition by list (a);
create table listp12 partition of listp for values in(1,2);
create table listp03 partition of listp for vlaues in(0,3);
create table listp45 partition of listp for values in(4,5);
create table listpd partition of listp default;

select * from listp where a in(1,2,4,5);

Here we prune all but listp12 and listp45. Since the default is pruned
and listp03 is pruned then there are no interleaved values. By your
proposed design the natural ordering is not detected since we're
storing a flag that says the partitions are unordered due to listp03.

No, what I'm proposing is to store if the partitions are naturally
ordered or not, *and* recheck after pruning if that property could
have changed (so if some partitions have been pruned). So we avoid
extra processing if we already knew that the partitions were ordered
(possibly with the default partition pruning information), or if we
know that the partitions are not ordered and no partition have been
pruned.

#25

David Rowley

david.rowley@2ndquadrant.com

about 7 years ago

In reply to: Julien Rouhaud (#24)

Re: Ordered Partitioned Table Scans

On Thu, 20 Dec 2018 at 18:20, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Dec 19, 2018 at 11:08 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

create table listp (a int) partition by list (a);
create table listp12 partition of listp for values in(1,2);
create table listp03 partition of listp for vlaues in(0,3);
create table listp45 partition of listp for values in(4,5);
create table listpd partition of listp default;

select * from listp where a in(1,2,4,5);

Here we prune all but listp12 and listp45. Since the default is pruned
and listp03 is pruned then there are no interleaved values. By your
proposed design the natural ordering is not detected since we're
storing a flag that says the partitions are unordered due to listp03.

No, what I'm proposing is to store if the partitions are naturally
ordered or not, *and* recheck after pruning if that property could
have changed (so if some partitions have been pruned). So we avoid
extra processing if we already knew that the partitions were ordered
(possibly with the default partition pruning information), or if we
know that the partitions are not ordered and no partition have been
pruned.

I see. So if the flag says "Yes", then we can skip the plan-time
check, if it says "No" and partitions were pruned, then we need to
re-check as the reason the flag says "No" might have been pruned away.

I guess that works, but I had imagined that the test wouldn't have
been particularly expensive. As more partitions are left unpruned then
such a test costing a bit more I thought would have been unlikely to
be measurable, but then I've not written the code yet.

Are you saying that you think this patch should have this? Or are you
happy to leave it until the next round?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#26

Julien Rouhaud

rjuju123@gmail.com

about 7 years ago

In reply to: David Rowley (#25)

Re: Ordered Partitioned Table Scans

On Sun, Jan 6, 2019 at 4:24 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Thu, 20 Dec 2018 at 18:20, Julien Rouhaud <rjuju123@gmail.com> wrote:

No, what I'm proposing is to store if the partitions are naturally
ordered or not, *and* recheck after pruning if that property could
have changed (so if some partitions have been pruned). So we avoid
extra processing if we already knew that the partitions were ordered
(possibly with the default partition pruning information), or if we
know that the partitions are not ordered and no partition have been
pruned.

I see. So if the flag says "Yes", then we can skip the plan-time
check, if it says "No" and partitions were pruned, then we need to
re-check as the reason the flag says "No" might have been pruned away.

Exactly.

I guess that works, but I had imagined that the test wouldn't have
been particularly expensive. As more partitions are left unpruned then
such a test costing a bit more I thought would have been unlikely to
be measurable, but then I've not written the code yet.

That's where my objection is I think. IIUC, the tests aren't not
especially expensive, one reason is because the multi-value list
partitioning case is out of scope. I was thinking that this flag
would allow that keep this case in scope while not adding much
overhead, and could still be useful with future enhancements (though
optimizing some cycles with huge number of partitions is probably as
you said a drop in the ocean).

Are you saying that you think this patch should have this? Or are you
happy to leave it until the next round?

I'd be happy if we can handle in an efficient way ordered partitioned
table scan, including multi-value list partitioning, eventually. So
if that means that this optimization if not the best way to handle it,
or if it's just not the best time to implement it I'm perfectly fine
with it.

#27

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#25)

1 attachment(s)

Re: Ordered Partitioned Table Scans

I've attached a rebased patch which fixes up the recent conflicts. No
other changes were made.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v7-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v7-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From 7dff70a683151f721590af05bdab321d5123aece Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v7] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 209 ++++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 ++++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  59 +++++++
 src/backend/optimizer/plan/createplan.c       |  90 ++++++++---
 src/backend/optimizer/plan/planner.c          |   3 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          |  59 +++++++
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   2 +
 src/test/regress/expected/inherit.out         | 211 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++----
 src/test/regress/sql/inherit.sql              |  93 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 770 insertions(+), 124 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 9d44e3e4c6..1117094b47 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1840,6 +1840,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2144e14ec8..55307b6846 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1653,7 +1654,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1695,7 +1696,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1745,19 +1746,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1807,41 +1808,67 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
+ * cheapest total paths, and build a suitable path for each case.
  *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1850,6 +1877,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1892,26 +1942,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -2052,6 +2157,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2072,7 +2205,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b8d406f230..61aaeabaa2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1876,7 +1876,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1888,21 +1888,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath. This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1910,6 +1945,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index dfbbfdac6d..de28228adf 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1237,7 +1237,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..753a1c24a2 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,7 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +547,64 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1b4f7db649..65f2c8b35c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -200,8 +200,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1053,12 +1051,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1084,6 +1094,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1093,6 +1120,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1135,10 +1195,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5315,23 +5376,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b2239728cf..76c308f0e9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3926,6 +3926,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6962,7 +6963,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 55eeb5127c..c0ed9f5488 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b57de6b4c6..096c0b5132 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1220,7 +1220,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1254,7 +1254,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1264,10 +1264,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1275,6 +1279,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1288,7 +1301,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3623,7 +3636,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index d478ae7e19..df30959fcf 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 8c9721935d..9187e71399 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -175,7 +175,66 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
 
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index d3c477a542..4ca4753baa 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1326,6 +1326,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index d0c8f99d0a..13398d108e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -65,7 +65,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 1b02b3b889..2bc454a214 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -201,6 +201,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 397ffaab36..161ae21a43 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -72,6 +72,8 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f259d07535..8e0b0d2e1f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2001,7 +2001,210 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+explain (costs off) select * from bool_rp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_idx on bool_rp_false
+   ->  Index Only Scan using bool_rp_true_b_idx on bool_rp_true
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2014,17 +2217,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 120b651bf5..32a7d7b93e 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 425052c1f4..58b9a055d5 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -700,8 +700,101 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+
+explain (costs off) select * from bool_rp order by b;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#28

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: David Rowley (#27)

Re: Ordered Partitioned Table Scans

On Thu, Jan 31, 2019 at 04:29:56PM +1300, David Rowley wrote:

I've attached a rebased patch which fixes up the recent conflicts. No
other changes were made.

Moved to next CF, waiting for review.
--
Michael

#29

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#27)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Thu, 31 Jan 2019 at 16:29, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached a rebased patch which fixes up the recent conflicts. No
other changes were made.

Rebased version due to a new call to create_append_path() added in
ab5fcf2b0. No other changes made.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v8-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v8-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From ce82e2c61207890a5be270e8f3dd904a9a30257b Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v8] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 211 +++++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 ++++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  59 +++++++
 src/backend/optimizer/plan/createplan.c       |  90 ++++++++---
 src/backend/optimizer/plan/planner.c          |   6 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          |  59 +++++++
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   2 +
 src/test/regress/expected/inherit.out         | 211 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++----
 src/test/regress/sql/inherit.sql              |  93 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 773 insertions(+), 126 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe65b..d94944e62f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1838,6 +1838,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0debac75c6..a1e592b1f8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1550,7 +1551,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1592,7 +1593,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1642,19 +1643,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1704,41 +1705,67 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1747,6 +1774,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1789,26 +1839,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1949,6 +2054,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1969,7 +2102,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..00b0f8a619 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath. This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs sorted, set the startup cost
+				 * of the sort as the startup cost of the Append
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1941,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index dfbbfdac6d..de28228adf 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1237,7 +1237,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..753a1c24a2 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,7 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +547,64 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 236f506cfb..3a90934933 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1099,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1125,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1200,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5281,23 +5342,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bc81535905..5ffbad48d9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1602,7 +1602,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3915,6 +3916,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6951,7 +6953,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 55eeb5127c..c0ed9f5488 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..356d2469c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,7 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1252,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1263,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1276,7 +1289,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3708,7 +3721,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index e71eb3793b..b32529090b 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1682,6 +1682,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1699,6 +1701,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 8c9721935d..9187e71399 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -175,7 +175,66 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
 
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a008ae07da..66e34a162a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1359,6 +1359,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 040335a7c5..66246136d9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -195,6 +195,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..53e492108b 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,8 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 565d947b6d..11f7ec1888 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2042,7 +2042,210 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+explain (costs off) select * from bool_rp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_idx on bool_rp_false
+   ->  Index Only Scan using bool_rp_true_b_idx on bool_rp_true
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2055,17 +2258,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 30946f77b6..2c9e42b51a 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..f696808a98 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,101 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+
+explain (costs off) select * from bool_rp order by b;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#30

Antonin Houska

ah@cybertec.at

almost 7 years ago

In reply to: David Rowley (#15)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> wrote:

On Mon, 5 Nov 2018 at 10:46, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 1 November 2018 at 22:05, Antonin Houska <ah@cybertec.at> wrote:

I think these conditions are too restrictive:

/*
* Determine if these pathkeys match the partition order, or reverse
* partition order. It can't match both, so only go to the trouble of
* checking the reverse order when it's not in ascending partition
* order.
*/
partition_order = pathkeys_contained_in(pathkeys,
partition_pathkeys);
partition_order_desc = !partition_order &&
pathkeys_contained_in(pathkeys,
partition_pathkeys_desc);

The problem with doing that is that if the partition keys are better
than the pathkeys then we'll most likely fail to generate any
partition keys at all due to lack of any existing eclass to use for
the pathkeys. It's unsafe to use just the prefix in this case as the
eclass may not have been found due to, for example one of the
partition keys having a different collation than the required sort
order of the query. In other words, we can't rely on a failure to
create the pathkey meaning that a more strict sort order is not
required.

I had another look at this patch and it seems okay just to add a new
flag to build_partition_pathkeys() to indicate if the pathkey List was
truncated or not. In generate_mergeappend_paths() we can then just
check that flag before checking if the partiiton pathkeys are
contained in pathkeys. It's fine if the partition keys were truncated
for the reverse of that check.

I've done this in the attached and added additional regression tests
for this case.

I agree with this approach and I'm also fine with your other comments / code
changes to address my review.

As for the latest version (v8-0001-...) I've only caught a small typo: "When
the first subpath needs sorted ...". It was probably meant "... needs sort
...".

--
Antonin Houska
https://www.cybertec-postgresql.com

#31

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Antonin Houska (#30)

1 attachment(s)

Re: Ordered Partitioned Table Scans

Thanks a lot for taking the time to look at this.

On Tue, 5 Mar 2019 at 03:03, Antonin Houska <ah@cybertec.at> wrote:

As for the latest version (v8-0001-...) I've only caught a small typo: "When
the first subpath needs sorted ...". It was probably meant "... needs sort
...".

That was a sort of short way of saying "needs [to be] sorted". I've
added in the missing "to be" in the attached.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchapplication/octet-stream; name=v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patchDownload

From 50303bdd7f316c4247efaaa510071681185893df Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v9] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 211 +++++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 ++++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  59 +++++++
 src/backend/optimizer/plan/createplan.c       |  90 ++++++++---
 src/backend/optimizer/plan/planner.c          |   6 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          |  59 +++++++
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   2 +
 src/test/regress/expected/inherit.out         | 211 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++----
 src/test/regress/sql/inherit.sql              |  93 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 773 insertions(+), 126 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe65b..d94944e62f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1838,6 +1838,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0debac75c6..a1e592b1f8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -1550,7 +1551,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1592,7 +1593,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1642,19 +1643,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1704,41 +1705,67 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1747,6 +1774,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1789,26 +1839,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1949,6 +2054,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1969,7 +2102,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->pathlist = NIL;
 	rel->partial_pathlist = NIL;
 
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..bf45b15453 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath. This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs to be sorted, set the startup
+				 * cost of the sort as the startup cost of the Append.
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1941,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index dfbbfdac6d..de28228adf 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1237,7 +1237,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..753a1c24a2 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,7 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +547,64 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 236f506cfb..3a90934933 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1099,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1125,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1200,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5281,23 +5342,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bc81535905..5ffbad48d9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1602,7 +1602,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3915,6 +3916,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
@@ -6951,7 +6953,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		 * node, which would cause this relation to stop appearing to be a
 		 * dummy rel.)
 		 */
-		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL,
+		rel->pathlist = list_make1(create_append_path(root, rel, NIL, NIL, NIL,
 													  NULL, 0, false, NIL,
 													  -1));
 		rel->partial_pathlist = NIL;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..356d2469c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,7 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1252,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1263,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1276,7 +1289,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3708,7 +3721,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index e71eb3793b..b32529090b 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1682,6 +1682,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1699,6 +1701,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 8c9721935d..9187e71399 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -175,7 +175,66 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
 
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a008ae07da..66e34a162a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1359,6 +1359,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 040335a7c5..66246136d9 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -195,6 +195,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..53e492108b 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,8 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 565d947b6d..11f7ec1888 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2042,7 +2042,210 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+explain (costs off) select * from bool_rp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_idx on bool_rp_false
+   ->  Index Only Scan using bool_rp_true_b_idx on bool_rp_true
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2055,17 +2258,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 30946f77b6..2c9e42b51a 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..f696808a98 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,101 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is uses when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_rp (b bool) partition by list(b);
+create table bool_rp_true partition of bool_rp for values in(true);
+create table bool_rp_false partition of bool_rp for values in(false);
+create index on bool_rp (b);
+
+explain (costs off) select * from bool_rp order by b;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#32

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: David Rowley (#23)

Re: Ordered Partitioned Table Scans

On Wed, Dec 19, 2018 at 5:08 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

With my idea for using live_parts, we'll process the partitions
looking for interleaved values on each query, after pruning takes
place. In this case, we'll see the partitions are naturally ordered. I
don't really foresee any issues with that additional processing since
it will only be a big effort when there are a large number of
partitions, and in those cases the planner already has lots of work to
do. Such processing is just a drop in the ocean when compared to path
generation for all those partitions.

I agree that partitions_are_ordered() is cheap enough in this patch
that it probably doesn't matter whether we cache the result. On the
other hand, that's mostly because you haven't handled the hard cases -
e.g. interleaved list partitions. If you did, then it would be
expensive, and it probably *would* be worth caching the result. Now
maybe those hard cases aren't worth handling anyway.

You also seem to be saying that since we run-time partitioning pruning
might change the answer, caching the initial answer is pointless. But
I think Julien has made a good argument for why that's wrong: if the
initial answer is that the partitions are ordered, which will often be
true, then we can skip all later checks.

So I am OK with the fact that this patch doesn't choose to cache it,
but I don't really buy any of your arguments for why it would be a bad
idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Robert Haas (#32)

Re: Ordered Partitioned Table Scans

On Wed, 6 Mar 2019 at 07:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 19, 2018 at 5:08 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

With my idea for using live_parts, we'll process the partitions
looking for interleaved values on each query, after pruning takes
place. In this case, we'll see the partitions are naturally ordered. I
don't really foresee any issues with that additional processing since
it will only be a big effort when there are a large number of
partitions, and in those cases the planner already has lots of work to
do. Such processing is just a drop in the ocean when compared to path
generation for all those partitions.

I agree that partitions_are_ordered() is cheap enough in this patch
that it probably doesn't matter whether we cache the result. On the
other hand, that's mostly because you haven't handled the hard cases -
e.g. interleaved list partitions. If you did, then it would be
expensive, and it probably *would* be worth caching the result. Now
maybe those hard cases aren't worth handling anyway.

I admit that I didn't understand the idea of the flag at the time,
having failed to see the point of it since if partitions are plan-time
pruned then I had thought the flag would be useless. However, as
Julien explained, it would be a flag of "Yes" means "Yes", okay to do
ordered scans, and "No" means "Recheck if there are pruned partitions
using only the non-pruned ones". That seems fine and very sane to me
now that I understand it. FWIW, my moment of realisation came in [1]/messages/by-id/CAKJS1f_r51OAPsN1oC4i36D7vznnihNk+1wiDFG0qRVb_eOKWg@mail.gmail.com.

However, my thoughts are that adding new flags and the live_parts
field in RelOptInfo raise the bar a bit for this patch. There's
already quite a number of partition-related fields in RelOptInfo.
Understanding what each of those does is not trivial, so I figured
that this patch would be much easier to consider if I skipped that
part for the first cut version. I feared a lot of instability of
what fields exist from Amit's planner improvement patches and I didn't
want to deal with dependencies from WIP. I had to deal with that last
year on run-time pruning and it turned out not to be fun.

You also seem to be saying that since we run-time partitioning pruning
might change the answer, caching the initial answer is pointless. But
I think Julien has made a good argument for why that's wrong: if the
initial answer is that the partitions are ordered, which will often be
true, then we can skip all later checks.

So I am OK with the fact that this patch doesn't choose to cache it,
but I don't really buy any of your arguments for why it would be a bad
idea.

OK, good. I agree. For the record; I want to steer clear of the flag
in this first cut version, especially so now given what time it is.

[1]: /messages/by-id/CAKJS1f_r51OAPsN1oC4i36D7vznnihNk+1wiDFG0qRVb_eOKWg@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#34

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#31)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> writes:

[ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ]

I took a quick look through this and I'm not very happy with it.
It seems to me that the premise ought to be "just use an Append
if we can prove the output would be ordered anyway", but that's not
what we actually have here: instead you're adding more infrastructure
onto Append, which notably involves invasive changes to the API of
create_append_path, which is the main reason why the patch keeps breaking.
(It's broken again as of HEAD, though the cfbot doesn't seem to have
noticed yet.) Likewise there's a bunch of added complication in
cost_append, create_append_plan, etc. I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

There also seem to be bits that duplicate functionality of the
drop-single-child-[Merge]Append patch (specifically I'm looking
at get_singleton_append_subpath). Why do we need that?

The logic in build_partition_pathkeys is noticeably stupider than
build_index_pathkeys, in particular it's not bright about boolean columns.
Maybe that's fine, but if so it deserves a comment explaining why we're
not bothering. Also, the comment for build_index_pathkeys specifies that
callers should do truncate_useless_pathkeys, which they do; why is that
not relevant here?

regards, tom lane

#35

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: Tom Lane (#34)

Re: Ordered Partitioned Table Scans

On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

[ v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch ]

I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x". Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.

#36

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Julien Rouhaud (#35)

Re: Ordered Partitioned Table Scans

Julien Rouhaud <rjuju123@gmail.com> writes:

On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x". Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing. Nor do I believe that this case is all that important.

regards, tom lane

#37

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#36)

Re: Ordered Partitioned Table Scans

On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Julien Rouhaud <rjuju123@gmail.com> writes:

On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x". Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing.

With all due respect, I'd say that's not even close to being true.

A MergeAppend's startup cost end up set to the sum of all of its
subplan's startup costs, plus any Sort that will be required if the
subpath is not sufficiently ordered already. An Append's startup cost
will just be the startup cost of the first subpath. This can happen
since, unlike MergeAppend, we don't need to pull the first tuple out
of such subnode to find the lowest one. In Julien's case, such an
Append plan has a potential of weighing in massively cheaper than a
MergeAppend plan. Just imagine some large sorts in some later
subpath.

Can you explain why you think that's not properly being estimated in the patch?

Nor do I believe that this case is all that important.

Can you explain why you believe that?

I see you were the author of b1577a7c78d which was committed over 19
years ago, so I'm surprised to hear you say cheap startup plans are
not important. Or is it, you just don't think they're important for
partitioned tables?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#38

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#34)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Sat, 9 Mar 2019 at 09:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a quick look through this and I'm not very happy with it.
It seems to me that the premise ought to be "just use an Append
if we can prove the output would be ordered anyway", but that's not
what we actually have here: instead you're adding more infrastructure
onto Append, which notably involves invasive changes to the API of
create_append_path, which is the main reason why the patch keeps breaking.

Can you suggest how else we could teach higher paths that an Append is
ordered by some path keys without giving the append some pathkeys?
That's what pathkeys are for, so I struggle to imagine how else this
could work. If we don't do this, then how is a MergeJoin going to
know it does not need to sort before joining?

As for the "the patch keeps breaking"... those are just conflicts
with other changes that have been made in master. That seems fairly
normal to me.

(It's broken again as of HEAD, though the cfbot doesn't seem to have
noticed yet.)

I think it's not been updating itself for a few days.

Likewise there's a bunch of added complication in
cost_append, create_append_plan, etc. I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

I think the patch would be less than half as useful if we do that.
Can you explain why you think that fast startup plans are less
important for partitioned tables?

I could perhaps understand an argument against this if the patch added
masses of complex code to achieve the goal, but in my opinion, the
code is fairly easy to understand and there's not very much extra code
added.

There also seem to be bits that duplicate functionality of the
drop-single-child-[Merge]Append patch (specifically I'm looking
at get_singleton_append_subpath). Why do we need that?

hmm, that patch is separate functionality. The patch you're talking
about, as you know, just removes Append/MergeAppends that have a
single subpath. Over here we add smarts to allow conversion of
MergeAppends into Appends when the order of the partitions is defined
the same as the required order of the, would be, MergeAppend path.

get_singleton_append_subpath() allows sub-partitioned table's
MergeAppend or Append subpaths to be pulled into the top-level Appends
when they just contain a single subpath.

An example from the tests:

Append
-> Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
-> Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
-> Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
-> Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
-> Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
-> Merge Append
Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
-> Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
-> Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def

If the nested MergeAppend path just had a single node then
get_singleton_append_subpath() would have pulled the subpath into the
top-level Append. However, in this example, since there are multiple
MergeAppend subpath, the pull-up would be invalid since the top-level
Append can't guarantee the sort order of those MergeAppend subpaths.
In fact, the test directly after that one drops the mcrparted5_def
table which turns the plan into:

Append
-> Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
-> Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
-> Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
-> Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
-> Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
-> Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a

The logic in build_partition_pathkeys is noticeably stupider than
build_index_pathkeys, in particular it's not bright about boolean columns.
Maybe that's fine, but if so it deserves a comment explaining why we're
not bothering.

Good point. That's required to allow cases like:

SELECT * FROM parttable WHERE boolcol = true ORDER BY boolcol, ordercol;

I've fixed that in the attached.

Also, the comment for build_index_pathkeys specifies that
callers should do truncate_useless_pathkeys, which they do; why is that
not relevant here?

I've neglected to explain that in the comments. The problem with that
is that doing so would break cases where we use an Append when the
partition keys are a prefix of the query's pathkeys. Say we had a
range partition table on (a,b) and an index on (a, b, c):

SELECT * FROM range_ab ORDER BY a, b, c;

With the current patch, we can use an Append for that as no earlier
value of (a,b) can come in a later partition. If we
truncate_useless_pathkeys() then it'll strip out all pathkeys as the
partition pathkeys are not contained within the query pathkeys.

Maybe the patch should perform truncate_useless_pathkeys for the check
where we see if the query pathkeys are contained in the partition's
pathkeys. However, we can't do it for the reverse check.

I've added a comment to explain about the lack of
truncate_useless_pathkeys() in the attached.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v10-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-.patchapplication/octet-stream; name=v10-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-.patchDownload

From d0f9484bf5467256d7840f12f83c5af3bb997d9f Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v10] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 223 +++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 +++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  71 ++++++++
 src/backend/optimizer/plan/createplan.c       |  90 +++++++---
 src/backend/optimizer/plan/planner.c          |   4 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          | 137 ++++++++++++++-
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   4 +
 src/test/regress/expected/inherit.out         | 233 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++---
 src/test/regress/sql/inherit.sql              | 103 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 907 insertions(+), 126 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69179a07c3..df6f7d08a9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1840,6 +1840,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d8ba7add13..6f4bce2e5d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1552,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1594,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1644,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,41 +1706,79 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on
+		 * (a, b), and a query with an ORDER BY a, b, c.  We can still allow
+		 * an Append scan in this case.  Imagine a partitions has a btree
+		 * index on (a, b, c), scanning that index  still provides tuples in
+		 * the correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1748,6 +1787,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1790,26 +1852,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1950,6 +2067,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1973,7 +2118,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/*
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..1cfb285b67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath.  This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs to be sorted, set the startup
+				 * cost of the sort as the startup cost of the Append.
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1941,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 699a34d6cf..4edf7e3748 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NULL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL, NULL,
 											  0, false, NIL, -1));
 
 	/* Set or update cheapest_total_path and related fields */
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..847e6a819d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,8 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +548,75 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9fbe5b2a5f..9dd7f54e6e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1099,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1125,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1200,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5281,23 +5342,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5e3a7120ff..ddee11902b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1606,7 +1606,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3887,6 +3888,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..356d2469c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,7 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1252,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1263,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1276,7 +1289,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3708,7 +3721,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index e71eb3793b..b32529090b 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1682,6 +1682,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1699,6 +1701,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index b5c0889935..bd3337ca37 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -109,7 +109,9 @@ typedef struct PruneStepResult
 	bool		scan_null;		/* Scan the partition for NULL values? */
 } PruneStepResult;
 
-
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol,
+								 RelOptInfo *partrel);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 							  RelOptInfo *parentrel,
 							  int *relid_subplan_map,
@@ -176,7 +178,140 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.  Restriction clauses like WHERE
+ * partkeycol = constant, get turned into an EquivalenceClass containing a
+ * constant, which is recognized as redundant by build_partition_pathkeys().
+ * But if the partition column is a boolean variable (or expression), then we
+ * are not going to see WHERE partkeycol = constant, because expression
+ * preprocessing will have simplified that to "WHERE partkeycol" or
+ * "WHERE NOT partkeycol".  So we are not going to have a matching
+ * EquivalenceClass (unless the query also contains "ORDER BY partkeycol").
+ * To allow such cases to work the same as they would for non-boolean values,
+ * this function is provided to detect whether the specified partkey column
+ * matches a boolean restriction clause.
+ */
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme		partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
 
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+								 RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *) rinfo->clause;
+	Expr	   *partexpr = (Expr *) linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *) get_notclausearg((Expr *) clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..0bab42e853 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1361,6 +1361,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..09a9884d7c 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partkey_is_bool_constant_for_query(struct RelOptInfo *partrel,
+								   int partkeycol);
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 565d947b6d..4bf9ca156b 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2042,7 +2042,232 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_b_a_idx on bool_rp_true
+         Index Cond: (b = true)
+(3 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_a_idx on bool_rp_false
+         Index Cond: (b = false)
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2055,17 +2280,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 30946f77b6..2c9e42b51a 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..c20ea9e51e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#39

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#38)

1 attachment(s)

Re: Ordered Partitioned Table Scans

I've attached an updated patch which fixes the conflict with 0a9d7e1f6d8

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v11-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-.patchapplication/octet-stream; name=v11-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-.patchDownload

From 72c06d332dc8cb9d5b0247f8f7fc5233e0f47268 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v11] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 223 +++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 +++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  71 ++++++++
 src/backend/optimizer/plan/createplan.c       |  90 +++++++---
 src/backend/optimizer/plan/planner.c          |   4 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          | 137 ++++++++++++++-
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   4 +
 src/test/regress/expected/inherit.out         | 233 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++---
 src/test/regress/sql/inherit.sql              | 103 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 907 insertions(+), 126 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69179a07c3..df6f7d08a9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1840,6 +1840,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b2c5c833f7..393b20f808 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1552,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1594,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1644,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,41 +1706,79 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on
+		 * (a, b), and a query with an ORDER BY a, b, c.  We can still allow
+		 * an Append scan in this case.  Imagine a partitions has a btree
+		 * index on (a, b, c), scanning that index  still provides tuples in
+		 * the correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1748,6 +1787,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1790,26 +1852,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1950,6 +2067,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1973,7 +2118,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..1cfb285b67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath.  This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs to be sorted, set the startup
+				 * cost of the sort as the startup cost of the Append.
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1941,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..82a553beeb 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..847e6a819d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,8 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +548,75 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9fbe5b2a5f..9dd7f54e6e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1099,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1125,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1200,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5281,23 +5342,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..02f805129f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1597,7 +1597,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3878,6 +3879,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..356d2469c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,7 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1252,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1263,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1276,7 +1289,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3708,7 +3721,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 5b897d50ee..3da775e101 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index b5c0889935..bd3337ca37 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -109,7 +109,9 @@ typedef struct PruneStepResult
 	bool		scan_null;		/* Scan the partition for NULL values? */
 } PruneStepResult;
 
-
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol,
+								 RelOptInfo *partrel);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 							  RelOptInfo *parentrel,
 							  int *relid_subplan_map,
@@ -176,7 +178,140 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.  Restriction clauses like WHERE
+ * partkeycol = constant, get turned into an EquivalenceClass containing a
+ * constant, which is recognized as redundant by build_partition_pathkeys().
+ * But if the partition column is a boolean variable (or expression), then we
+ * are not going to see WHERE partkeycol = constant, because expression
+ * preprocessing will have simplified that to "WHERE partkeycol" or
+ * "WHERE NOT partkeycol".  So we are not going to have a matching
+ * EquivalenceClass (unless the query also contains "ORDER BY partkeycol").
+ * To allow such cases to work the same as they would for non-boolean values,
+ * this function is provided to detect whether the specified partkey column
+ * matches a boolean restriction clause.
+ */
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme		partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
 
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+								 RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *) rinfo->clause;
+	Expr	   *partexpr = (Expr *) linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *) get_notclausearg((Expr *) clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..0bab42e853 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1361,6 +1361,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..09a9884d7c 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partkey_is_bool_constant_for_query(struct RelOptInfo *partrel,
+								   int partkeycol);
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 565d947b6d..4bf9ca156b 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2042,7 +2042,232 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_b_a_idx on bool_rp_true
+         Index Cond: (b = true)
+(3 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_a_idx on bool_rp_false
+         Index Cond: (b = false)
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2055,17 +2280,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 30946f77b6..2c9e42b51a 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..c20ea9e51e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#40

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Tom Lane (#34)

Re: Ordered Partitioned Table Scans

On Fri, Mar 8, 2019 at 3:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I took a quick look through this and I'm not very happy with it.
It seems to me that the premise ought to be "just use an Append
if we can prove the output would be ordered anyway", but that's not
what we actually have here: instead you're adding more infrastructure
onto Append, which notably involves invasive changes to the API of
create_append_path, which is the main reason why the patch keeps breaking.
(It's broken again as of HEAD, though the cfbot doesn't seem to have
noticed yet.) Likewise there's a bunch of added complication in
cost_append, create_append_plan, etc. I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

Other people have already said that they don't think this is true; I
agree with those people. Even if you have to sort *every* path,
sorting a bunch of reasonably large data sets individually is possibly
better than sorting all the data together, because (1) you can start
emitting rows sooner, (2) it might make you fit in memory instead of
having to spill to disk, and (3) O(n lg n) is supralinear. Still, if
that were the only case this handled, I wouldn't be too excited,
because it seems at least plausible that lumping a bunch of small
partitions together and sorting it all at once could save some
start-up and tear-down costs vs. sorting them individually. But it
isn't; the ability to consider that sort of plan is just a fringe
benefit. If a substantial fraction of the partitions have indexes --
half, three-quarters, all-but-one -- sorting only the remaining ones
should win big.

Admittedly, I think this case is less common than it was a few years
ago, because with table inheritance one often ended up with a parent
partition that was empty and had no indexes so it produced a
dummy-seqscan in every plan, and that's gone with partitioning.
Moreover, because of Alvaro's work on cascaded CREATE INDEX, people
are probably now more likely to have matching indexes on all the
partitions. Still, it's not that hard to imagine a case where older
data that doesn't change much is more heavily indexed than tables that
are still suffering DML of whatever kind on a regular basis.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#39)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Sat, 16 Mar 2019 at 04:22, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached an updated patch which fixes the conflict with 0a9d7e1f6d8

... and here's the one that I should have sent. (renamed to v12 to
prevent confusion)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v12-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-.patchapplication/octet-stream; name=v12-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-.patchDownload

From a2afa8b32de31f702565ff13fbd44e8f48b79650 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 26 Oct 2018 09:18:09 +1300
Subject: [PATCH v12] Allow Append to be used in place of MergeAppend for some
 cases

For RANGE partitioned tables with no default partition the subpaths of a
MergeAppend are always arranged in range order. This means that
MergeAppend, when sorting by the partition key or a superset of the
partition key, will always output tuples from earlier subpaths before later
subpaths.  LIST partitioned tables provide the same guarantee if they also
don't have a default partition, providing that none of the partitions are
defined to allow Datums with values which are interleaved with other
partitions.  For simplicity and speed of checking we currently just
disallow the optimization if any partition allows more than one Datum.
We may want to expand this later, but for now, it's a very cheap check to
implement.  A more thorough check would require performing analysis on the
partition bound.
---
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/optimizer/path/allpaths.c         | 223 +++++++++++++++++++-----
 src/backend/optimizer/path/costsize.c         |  51 +++++-
 src/backend/optimizer/path/joinrels.c         |   2 +-
 src/backend/optimizer/path/pathkeys.c         |  71 ++++++++
 src/backend/optimizer/plan/createplan.c       |  90 +++++++---
 src/backend/optimizer/plan/planner.c          |   4 +-
 src/backend/optimizer/prep/prepunion.c        |   6 +-
 src/backend/optimizer/util/pathnode.c         |  23 ++-
 src/backend/partitioning/partbounds.c         |   4 +
 src/backend/partitioning/partprune.c          | 137 ++++++++++++++-
 src/include/nodes/pathnodes.h                 |   1 +
 src/include/optimizer/cost.h                  |   2 +-
 src/include/optimizer/pathnode.h              |   2 +-
 src/include/optimizer/paths.h                 |   2 +
 src/include/partitioning/partprune.h          |   4 +
 src/test/regress/expected/inherit.out         | 233 +++++++++++++++++++++++++-
 src/test/regress/expected/partition_prune.out |  64 ++++---
 src/test/regress/sql/inherit.sql              | 103 ++++++++++++
 src/test/regress/sql/partition_prune.sql      |  10 +-
 20 files changed, 907 insertions(+), 126 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 69179a07c3..df6f7d08a9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1840,6 +1840,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b2c5c833f7..393b20f808 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1552,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1594,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1644,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,41 +1706,79 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on
+		 * (a, b), and a query with an ORDER BY a, b, c.  We can still allow
+		 * an Append scan in this case.  Imagine a partitions has a btree
+		 * index on (a, b, c), scanning that index  still provides tuples in
+		 * the correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1748,6 +1787,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1790,26 +1852,81 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * When in partition order or decending partition order don't
+			 * flatten any sub-partition's paths unless they're an Append or
+			 * MergeAppend with a single subpath.  For the desceding order
+			 * case we build the path list in reverse so that the Append scan
+			 * correctly scans the partitions in reverse order.
+			 */
+			if (partition_order)
+			{
+				/* Do the Append/MergeAppend flattening, when possible */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1950,6 +2067,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1973,7 +2118,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..1cfb285b67 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,56 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
-
+		Path	   *isubpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 		/*
 		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
+		 * the first subpath.  This may be overwritten below if the initial
+		 * path requires a sort.
 		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		apath->path.startup_cost = isubpath->startup_cost;
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
+		/*
+		 * Compute rows and costs as sums of subplan rows and costs taking
+		 * into account the cost of any sorts which may be required on
+		 * subplans.
+		 */
 		foreach(l, apath->subpaths)
 		{
 			Path	   *subpath = (Path *) lfirst(l);
 
 			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+
+			if (pathkeys != NIL &&
+				!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				/*
+				 * We'll need to insert a Sort node, so include cost for that.
+				 */
+				cost_sort(&sort_path,
+						  root,
+						  pathkeys,
+						  subpath->total_cost,
+						  subpath->parent->tuples,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  apath->limit_tuples);
+				apath->path.total_cost += sort_path.total_cost;
+
+				/*
+				 * When the first subpath needs to be sorted, set the startup
+				 * cost of the sort as the startup cost of the Append.
+				 */
+				if (subpath == isubpath)
+					apath->path.startup_cost = sort_path.startup_cost;
+			}
+			else
+			{
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1941,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..7044899dc1 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..847e6a819d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,8 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +548,75 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9fbe5b2a5f..9dd7f54e6e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,24 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	int			nodenumsortkeys;
+	AttrNumber *nodeSortColIdx;
+	Oid		   *nodeSortOperators;
+	Oid		   *nodeCollations;
+	bool	   *nodeNullsFirst;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1099,23 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1125,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1200,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5281,23 +5342,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..02f805129f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1597,7 +1597,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3878,6 +3879,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..356d2469c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,7 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1252,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1263,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1276,7 +1289,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3708,7 +3721,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 5b897d50ee..3da775e101 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index b5c0889935..bd3337ca37 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -109,7 +109,9 @@ typedef struct PruneStepResult
 	bool		scan_null;		/* Scan the partition for NULL values? */
 } PruneStepResult;
 
-
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol,
+								 RelOptInfo *partrel);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 							  RelOptInfo *parentrel,
 							  int *relid_subplan_map,
@@ -176,7 +178,140 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.  Restriction clauses like WHERE
+ * partkeycol = constant, get turned into an EquivalenceClass containing a
+ * constant, which is recognized as redundant by build_partition_pathkeys().
+ * But if the partition column is a boolean variable (or expression), then we
+ * are not going to see WHERE partkeycol = constant, because expression
+ * preprocessing will have simplified that to "WHERE partkeycol" or
+ * "WHERE NOT partkeycol".  So we are not going to have a matching
+ * EquivalenceClass (unless the query also contains "ORDER BY partkeycol").
+ * To allow such cases to work the same as they would for non-boolean values,
+ * this function is provided to detect whether the specified partkey column
+ * matches a boolean restriction clause.
+ */
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme		partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
 
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+								 RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *) rinfo->clause;
+	Expr	   *partexpr = (Expr *) linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *) get_notclausearg((Expr *) clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..0bab42e853 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1361,6 +1361,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..09a9884d7c 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partkey_is_bool_constant_for_query(struct RelOptInfo *partrel,
+								   int partkeycol);
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 565d947b6d..4bf9ca156b 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2042,7 +2042,232 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Append
+   ->  Sort
+         Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+         ->  Seq Scan on mcrparted0
+               Filter: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(11 rows)
+
+reset enable_seqscan;
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_b_a_idx on bool_rp_true
+         Index Cond: (b = true)
+(3 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_a_idx on bool_rp_false
+         Index Cond: (b = false)
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2055,17 +2280,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 30946f77b6..2c9e42b51a 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..c20ea9e51e 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+set enable_seqscan = 0;
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+reset enable_seqscan;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;
-- 
2.16.2.windows.1

#42

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#36)

Re: Ordered Partitioned Table Scans

On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Julien Rouhaud <rjuju123@gmail.com> writes:

On Fri, Mar 8, 2019 at 9:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you should remove all that
and restrict this optimization to the case where all the subpaths are
natively ordered --- if we have to insert Sorts, it's hardly going to move
the needle to worry about simplifying the parent MergeAppend to Append.

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x". Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing. Nor do I believe that this case is all that important.

Hi Tom,

Wondering if you can provide more details on why you think there's no
realistic chance of the planner costing these cases correctly? It
would be unfortunate to reject this patch based on a sentence that
starts with "[ shrug... ]", especially so when three people have stood
up and disagreed with you.

I've explained why I think you're wrong. Would you be able to explain
to me why you think I'm wrong?

You also mentioned that you didn't like the fact I'd changed the API
for create_append_plan(). Could you suggest why you think passing
pathkeys in is the wrong thing to do? The Append path obviously needs
pathkeys so that upper paths know what order the path guarantees.
Passing pathkeys in allows us to verify that pathkeys are a valid
thing to have for the AppendPath. They're not valid in the Parallel
Append case, for example, so setting them afterwards does not seem
like an improvement. They also allow us to cost the cheaper startup
cost properly, however, you did seem to argue that you have no
confidence in cheap startup plans, which I'm still confused by.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#43

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#42)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> writes:

On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x". Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing. Nor do I believe that this case is all that important.

Wondering if you can provide more details on why you think there's no
realistic chance of the planner costing these cases correctly?

The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens. This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent. Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.

If we don't incorporate that, then the situation that the planner
will have to model is a MergeAppend with possibly some sorts in
front of it, and it will correctly cost that as if all the sorts
happen before any output occurs, and so you can hope that reasonable
plan choices will ensue.

I believe that the cases where a LIMIT query actually ought to go
for a fast-start plan are where *all* the child rels have fast-start
(non-sort) paths --- which is exactly the cases we'd get if we only
allow "sorted" Appends when none of the inputs require a sort.
Imagining that we'll get good results in cases where some of them
need to be sorted is just wishful thinking.

It would be unfortunate to reject this patch based on a sentence that
starts with "[ shrug... ]", especially so when three people have stood
up and disagreed with you.

I don't want to reject the patch altogether, I just want it to not
include a special hack to allow Append rather than MergeAppend in such
cases. I am aware that I'm probably going to be shouted down on this
point, but that will not change my opinion that the shouters are wrong.

regards, tom lane

#44

Simon Riggs

simon@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#43)

Re: Ordered Partitioned Table Scans

On Fri, 22 Mar 2019 at 11:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

On Sat, 9 Mar 2019 at 10:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This can be a huge win for queries of the form "ORDER BY partkey LIMIT
x". Even if the first subpath(s) aren't natively ordered, not all of
the sorts should actually be performed.

[ shrug... ] We've got no realistic chance of estimating such situations
properly, so I'd have no confidence in a plan choice based on such a
thing. Nor do I believe that this case is all that important.

Wondering if you can provide more details on why you think there's no
realistic chance of the planner costing these cases correctly?

The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens. This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent. Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.

If we don't incorporate that, then the situation that the planner
will have to model is a MergeAppend with possibly some sorts in
front of it, and it will correctly cost that as if all the sorts
happen before any output occurs, and so you can hope that reasonable
plan choices will ensue.

I believe that the cases where a LIMIT query actually ought to go
for a fast-start plan are where *all* the child rels have fast-start
(non-sort) paths --- which is exactly the cases we'd get if we only
allow "sorted" Appends when none of the inputs require a sort.
Imagining that we'll get good results in cases where some of them
need to be sorted is just wishful thinking.

It would be unfortunate to reject this patch based on a sentence that
starts with "[ shrug... ]", especially so when three people have stood
up and disagreed with you.

I don't want to reject the patch altogether, I just want it to not
include a special hack to allow Append rather than MergeAppend in such
cases. I am aware that I'm probably going to be shouted down on this
point, but that will not change my opinion that the shouters are wrong.

I agree that the issue of mixing sorts at various points will make nonsense
of the startup cost/total cost results.

What you say about LIMIT is exactly right. But ISTM that LIMIT itself is
the issue there and it need more smarts to correctly calculate costs.

I don't see LIMIT costing being broken as a reason to restrict this
optimization. I would ask that we allow improvements to the important use
case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#45

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#43)

Re: Ordered Partitioned Table Scans

On Sat, 23 Mar 2019 at 04:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The reason why I'm skeptical about LIMIT with a plan of the form
append-some-sorts-together is that there are going to be large
discontinuities in the cost-vs-number-of-rows-returned graph,
ie, every time you finish one child plan and start the next one,
there'll be a hiccup while the sort happens. This is something
that our plan cost approximation (startup cost followed by linear
output up to total cost) can't even represent. Sticking a
LIMIT on top of that is certainly not going to lead to any useful
estimate of the actual cost, meaning that if you get a correct
plan choice it'll just be by luck, and if you don't there'll be
nothing to be done about it.

Thanks for explaining. I see where you're coming from now. I think
this point would carry more weight if using the Append instead of the
MergeAppend were worse in some cases as we could end up using an
inferior plan accidentally. However, that's not the case. The Append
plan should always perform better both for startup and pulling a
single row all the way to pulling the final row. The underlying
subplans are the same in each case, but Append has the additional
saving of not having to determine to perform a sort on the top row
from each subpath.

I also think that cost-vs-number-of-rows-returned is not any worse
than a SeqScan where the required rows are unevenly distributed
throughout the table. In fact, the SeqScan case is much worse as we
could end up choosing that over an index scan, which could be
significantly better, but as mentioned above, and benchmarked in the
initial post of this thread, Append always wins over MergeAppend, so I
don't quite understand your reasoning here. I could understand it if
Append needed the sorts but MergeAppend did not, but they both need
the sorts if there's not a path that already provides the required
ordering.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#46

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Simon Riggs (#44)

Re: Ordered Partitioned Table Scans

Simon Riggs <simon@2ndquadrant.com> writes:

I agree that the issue of mixing sorts at various points will make nonsense
of the startup cost/total cost results.

Right.

I don't see LIMIT costing being broken as a reason to restrict this
optimization. I would ask that we allow improvements to the important use
case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

There's not time to reinvent LIMIT costing for v12. I'd be happy to
see some work done on that in the future, and when it does get done,
I'd be happy to see Append planning extended to allow this case.
I just don't think it's wise to ship one without the other.

regards, tom lane

#47

Simon Riggs

simon@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#46)

Re: Ordered Partitioned Table Scans

On Fri, 22 Mar 2019 at 11:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

I agree that the issue of mixing sorts at various points will make

nonsense

of the startup cost/total cost results.

Right.

I don't see LIMIT costing being broken as a reason to restrict this
optimization. I would ask that we allow improvements to the important use
case of ORDER BY/LIMIT, then spend time on making LIMIT work correctly.

There's not time to reinvent LIMIT costing for v12. I'd be happy to
see some work done on that in the future, and when it does get done,
I'd be happy to see Append planning extended to allow this case.
I just don't think it's wise to ship one without the other.

I was hoping to motivate you to look at this personally, and soon. LIMIT is
so broken that any improvements count as bug fixes in my book.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#48

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#45)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> writes:

Thanks for explaining. I see where you're coming from now. I think
this point would carry more weight if using the Append instead of the
MergeAppend were worse in some cases as we could end up using an
inferior plan accidentally. However, that's not the case. The Append
plan should always perform better both for startup and pulling a
single row all the way to pulling the final row. The underlying
subplans are the same in each case, but Append has the additional
saving of not having to determine to perform a sort on the top row
from each subpath.

Uh, what? sorted-Append and MergeAppend would need pre-sorts on
exactly the same set of children. It's true that the Append path
might not have to actually execute some of those sorts, if it's
able to stop in an earlier child. The problem here is basically
that it's hard to predict whether that will happen.

Append always wins over MergeAppend, so I
don't quite understand your reasoning here.

The problem is that the planner is likely to favor a "fast-start"
Append *too much*, and prefer it over some other plan altogether.

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

Yes, there are cases where Append-with-some-sorts is preferable to
MergeAppend-with-some-sorts, and maybe I'd even believe that it
always is. But I don't believe that it's necessarily preferable
to plans that don't require a sort at all, and I'm afraid that we
are likely to find the planner making seriously bad choices when
it's presented with such situations. I'd rather we leave that
case out for now, until we have some better way of modelling it.

The fact that the patch also requires a lot of extra hacking just
to support this case badly doesn't make me any more favorably
disposed.

regards, tom lane

#49

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Tom Lane (#48)

Re: Ordered Partitioned Table Scans

On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often. I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#50

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Robert Haas (#49)

Re: Ordered Partitioned Table Scans

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often. I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

Once again: this objection is not a "death sentence for this patch".
I simply wish to suppress the option to generate an ordered Append
when some of the inputs would require an added sort step. As long
as we have pre-ordered paths for all children, go for it.

regards, tom lane

#51

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Tom Lane (#50)

Re: Ordered Partitioned Table Scans

On Fri, Mar 22, 2019 at 12:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Once again: this objection is not a "death sentence for this patch".
I simply wish to suppress the option to generate an ordered Append
when some of the inputs would require an added sort step. As long
as we have pre-ordered paths for all children, go for it.

I stand corrected.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#52

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Robert Haas (#49)

Re: Ordered Partitioned Table Scans

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often. I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

regards, tom lane

#53

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Tom Lane (#52)

Re: Ordered Partitioned Table Scans

On Fri, Mar 22, 2019 at 12:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often. I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

Yeah, I'm not sure whether that's a good idea or not. I think one of
the problems with a cost-based optimizer is that once you start
putting things in with the wrong cost because you think it will give
the right answer, you're sorta playing with fire, because you can't
necessarily predict how things are going are going to turn out in more
complex scenarios. On the other hand, it may sometimes be the right
thing to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#54

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: Robert Haas (#53)

Re: Ordered Partitioned Table Scans

On Fri, Mar 22, 2019 at 7:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 22, 2019 at 12:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 22, 2019 at 11:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

I think this is very much a valid point, especially in view of the
fact that we already choose supposedly fast-start plans too often. I
don't know whether it's a death sentence for this patch, but it should
at least make us stop and think hard.

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

Yeah, I'm not sure whether that's a good idea or not. I think one of
the problems with a cost-based optimizer is that once you start
putting things in with the wrong cost because you think it will give
the right answer, you're sorta playing with fire, because you can't
necessarily predict how things are going are going to turn out in more
complex scenarios. On the other hand, it may sometimes be the right
thing to do.

I've been bitten too many times with super inefficient plans of the
form "let's use the wrong index instead of the good one because I'll
probably find there the tuple I want very quickly", due to LIMIT
assuming an even distribution. Since those queries can end up taking
dozens of minutes instead of less a ms, without a lot of control to
fix this kind of problem I definitely don't want to introduce another
similar source of pain for users.

However, what we're talking here is still a corner case. People
having partitioned tables with a mix of partitions with and without
indexes suitable for ORDER BY x LIMIT y queries should already have at
best average performance. And trying to handle this case cannot hurt
cases where all partitions have suitable indexes, so that may be an
acceptable risk?

I also have mixed feeling about this artificial startup cost penalty,
but if we go this way we can for sure cumulate the startup cost of all
sorts that we think we'll have to perform (according to each path's
estimated rows and the given limit_tuples). That probably won't be
enough though, especially with fractional paths.

#55

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#48)

Re: Ordered Partitioned Table Scans

On Sat, 23 Mar 2019 at 04:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

Append has the additional
saving of not having to determine to perform a sort on the top row
from each subpath.

Uh, what? sorted-Append and MergeAppend would need pre-sorts on
exactly the same set of children.

I was talking about the binary heap code that MergeAppend uses to
decide which subplan to pull from next.

In cases where, say, the first child requires no sort but also doesn't
emit very many rows, while the second child requires an expensive sort,
the planner will have a ridiculously optimistic opinion of the cost of
fetching slightly more rows than are available from the first child.
This might lead it to wrongly choose a merge join over a hash for example.

umm.. Yeah, that's a good point. I seemed to have failed to consider
that the fast startup plan could lower the cost of a merge join with a
limit. I agree with that concern. I also find it slightly annoying
since we already make other plan shapes that can suffer from similar
problems, e.g Index scan + filter + limit, but I agree we don't need
any more of those as they're pretty painful when they hit.

I'll change the patch around to pull out the code you've mentioned.

Thanks for spelling out your point to me.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#56

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#52)

Re: Ordered Partitioned Table Scans

On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

I admit to having thought of that while trying to get to sleep last
night, but I was too scared to even suggest it. It's pretty much how
MergeAppend would cost it anyway. I agree it's not pretty to lie
about the startup cost, but it does kinda seem silly to fall back on a
more expensive MergeAppend when we know fine well Append is cheaper.
Probably the danger would be that someone pulls it out thinking its a
bug. So we'd need to clearly comment why we're doing it.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#57

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#56)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> writes:

On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

I admit to having thought of that while trying to get to sleep last
night, but I was too scared to even suggest it. It's pretty much how
MergeAppend would cost it anyway. I agree it's not pretty to lie
about the startup cost, but it does kinda seem silly to fall back on a
more expensive MergeAppend when we know fine well Append is cheaper.

Yeah. I'm starting to think that this might actually be the way to go,
and here's why: my argument here is basically that a child plan that
has a large startup cost is going to screw up our ability to estimate
whether the parent Append is really a fast-start plan or not. Now, if
we have to insert a Sort to make a child plan be correctly ordered, that
clearly is a case where the child could have a large startup cost ...
but what if a correctly-ordered child has a large startup cost for
some other reason? Simply refusing to insert Sort nodes won't keep us
out of the weeds if that's true. However, if we stick in a hack like
the one suggested above, that will keep us from being too optimistic
about the fast-start properties of the Append node no matter whether
the problem arises from an added Sort node or is intrinsic to the
child plan.

It may well be that as things stand today, this scenario is only
hypothetical, because we can only prove that a plain-Append plan
is correctly sorted if it's arising from a suitably partitioned table,
and the child plans in such cases will all be IndexScans with
minimal startup cost. But we should look ahead to scenarios where
that's not true. (mumble maybe a foreign table as a partition
is already a counterexample? mumble)

regards, tom lane

#58

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: David Rowley (#21)

Re: Ordered Partitioned Table Scans

On Wed, Dec 19, 2018 at 3:01 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Thu, 20 Dec 2018 at 01:58, Julien Rouhaud <rjuju123@gmail.com> wrote:

The multi-level partitioning case is another
thing that would need to be handled for instance (and that's the main
reason I couldn't submit a new patch when I was working on it), and
I'm definitely not arguing to cover it in this patch.

As far as I'm aware, the multi-level partitioning should work just
fine with the current patch. I added code for that a while ago. There
are regression tests to exercise it. I'm not aware of any cases where
it does not work.

Sorry to come back this late. What I was mentioning about
sub-partitioning is when a whole partition hierarchy is natively
ordered, we could avoid the generate merge appends. But unless I'm
missing something with your patch, that won't happen.

Considering

CREATE TABLE nested (id1 integer, id2 integer, val text) PARTITION BY
LIST (id1);

CREATE TABLE nested_1 PARTITION OF nested FOR VALUES IN (1) PARTITION
BY RANGE (id2);
CREATE TABLE nested_1_1 PARTITION OF nested_1 FOR VALUES FROM (1) TO (100000);
CREATE TABLE nested_1_2 PARTITION OF nested_1 FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE nested_1_3 PARTITION OF nested_1 FOR VALUES FROM (200000)
TO (300000);

CREATE TABLE nested_2 PARTITION OF nested FOR VALUES IN (2) PARTITION
BY RANGE (id2);
CREATE TABLE nested_2_1 PARTITION OF nested_2 FOR VALUES FROM (1) TO (100000);
CREATE TABLE nested_2_2 PARTITION OF nested_2 FOR VALUES FROM (100000)
TO (200000);
CREATE TABLE nested_2_3 PARTITION OF nested_2 FOR VALUES FROM (200000)
TO (300000);

CREATE INDEX ON nested(id1, id2);

ISTM that a query like
SELECT * FROM nested ORDER BY 1, 2;
could simply append all the partitions in the right order (or generate
a tree of ordered appends), but:

QUERY PLAN
-------------------------------------------------------------------
Append
-> Merge Append
Sort Key: nested_1_1.id1, nested_1_1.id2
-> Index Scan using nested_1_1_id1_id2_idx on nested_1_1
-> Index Scan using nested_1_2_id1_id2_idx on nested_1_2
-> Index Scan using nested_1_3_id1_id2_idx on nested_1_3
-> Merge Append
Sort Key: nested_2_1.id1, nested_2_1.id2
-> Index Scan using nested_2_1_id1_id2_idx on nested_2_1
-> Index Scan using nested_2_2_id1_id2_idx on nested_2_2
-> Index Scan using nested_2_3_id1_id2_idx on nested_2_3
(11 rows)

Also, a query like
SELECT * FROM nested_1 ORDER BY 1, 2;
could generate an append path, since the first column is guaranteed to
be identical in all partitions, but instead:

QUERY PLAN
-------------------------------------------------------------
Merge Append
Sort Key: nested_1_1.id1, nested_1_1.id2
-> Index Scan using nested_1_1_id1_id2_idx on nested_1_1
-> Index Scan using nested_1_2_id1_id2_idx on nested_1_2
-> Index Scan using nested_1_3_id1_id2_idx on nested_1_3
(5 rows)

and of course

# EXPLAIN (costs off) SELECT * FROM nested_1 ORDER BY 2;
QUERY PLAN
------------------------------------
Sort
Sort Key: nested_1_1.id2
-> Append
-> Seq Scan on nested_1_1
-> Seq Scan on nested_1_2
-> Seq Scan on nested_1_3
(6 rows)

I admit that I didn't re-read the whole thread, so maybe I'm missing
something (if that's the case my apologies, and feel free to point me
any relevant discussion). I'm just trying to make sure that we don't
miss some cases, as those seems possible and useful to handle. Or is
that out of the perimeter?

#59

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Julien Rouhaud (#58)

Re: Ordered Partitioned Table Scans

On Sun, 24 Mar 2019 at 05:16, Julien Rouhaud <rjuju123@gmail.com> wrote:

ISTM that a query like
SELECT * FROM nested ORDER BY 1, 2;
could simply append all the partitions in the right order (or generate
a tree of ordered appends), but:

QUERY PLAN
-------------------------------------------------------------------
Append
-> Merge Append
Sort Key: nested_1_1.id1, nested_1_1.id2
-> Index Scan using nested_1_1_id1_id2_idx on nested_1_1
-> Index Scan using nested_1_2_id1_id2_idx on nested_1_2
-> Index Scan using nested_1_3_id1_id2_idx on nested_1_3
-> Merge Append
Sort Key: nested_2_1.id1, nested_2_1.id2
-> Index Scan using nested_2_1_id1_id2_idx on nested_2_1
-> Index Scan using nested_2_2_id1_id2_idx on nested_2_2
-> Index Scan using nested_2_3_id1_id2_idx on nested_2_3
(11 rows)

Also, a query like
SELECT * FROM nested_1 ORDER BY 1, 2;
could generate an append path, since the first column is guaranteed to
be identical in all partitions, but instead:

QUERY PLAN
-------------------------------------------------------------
Merge Append
Sort Key: nested_1_1.id1, nested_1_1.id2
-> Index Scan using nested_1_1_id1_id2_idx on nested_1_1
-> Index Scan using nested_1_2_id1_id2_idx on nested_1_2
-> Index Scan using nested_1_3_id1_id2_idx on nested_1_3
(5 rows)

and of course

# EXPLAIN (costs off) SELECT * FROM nested_1 ORDER BY 2;
QUERY PLAN
------------------------------------
Sort
Sort Key: nested_1_1.id2
-> Append
-> Seq Scan on nested_1_1
-> Seq Scan on nested_1_2
-> Seq Scan on nested_1_3
(6 rows)

I think both these cases could be handled, but I think the way it
would likely have to be done would be to run the partition constraints
through equivalence class processing. Likely doing that would need
some new field in EquivalenceClass that indicated that the eclass did
not need to be applied to the partition. If it was done that way then
pathkey_is_redundant() would be true for the id1 column's pathkey in
the sub-partitioned tables. The last plan you show above could also
use an index scan too since build_index_pathkeys() would also find the
pathkey redundant. Doing this would also cause a query like: select *
from nested_1_1 where id2=1; would not apply "id2 = 1" as a base qual
to the partition. That's good for 2 reasons. 1) No wasted effort
filtering rows that always match; and 2) A Seq scan can be used
instead of the planner possibly thinking that an index scan might be
useful to filter rows. Stats might tell the planner that anyway...
but...

I suggested some changes to equivalence classes a few years ago in [1]/messages/by-id/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A@mail.gmail.com
and I failed to get that idea floating. In ways, this is similar as
it requires having equivalence classes that are not used in all cases.
I think to get something working a week before code cutoff is a step
too far for this, but certainly, it would be interesting to look into
fixing it in a later release.

[1]: /messages/by-id/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#60

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#57)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Sat, 23 Mar 2019 at 19:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

I admit to having thought of that while trying to get to sleep last
night, but I was too scared to even suggest it. It's pretty much how
MergeAppend would cost it anyway. I agree it's not pretty to lie
about the startup cost, but it does kinda seem silly to fall back on a
more expensive MergeAppend when we know fine well Append is cheaper.

Yeah. I'm starting to think that this might actually be the way to go,

Here's a version with it done that way.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v13.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v13.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..755ef43caa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b2c5c833f7..a9f406331b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1552,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1594,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1644,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,41 +1706,79 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on
+		 * (a, b), and a query with an ORDER BY a, b, c.  We can still allow
+		 * an Append scan in this case.  Imagine each partition has a btree
+		 * index on (a, b, c), scanning those indexes still provides tuples in
+		 * the correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1748,6 +1787,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1790,26 +1852,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1950,6 +2072,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1973,7 +2123,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..68882287df 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+								root,
+								pathkeys,
+								subpath->total_cost,
+								subpath->parent->tuples,
+								subpath->pathtarget->width,
+								0.0,
+								work_mem,
+								apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..7044899dc1 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..847e6a819d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,8 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +548,75 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 93c56c657c..32f39d0355 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..02f805129f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1597,7 +1597,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3878,6 +3879,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 169e51e792..356d2469c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,7 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
-	pathnode->path.pathkeys = NIL;	/* result is always considered unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1252,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1263,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1276,7 +1289,7 @@ create_append_path(PlannerInfo *root,
 
 	Assert(!parallel_aware || pathnode->path.parallel_safe);
 
-	cost_append(pathnode);
+	cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3708,7 +3721,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 803c23aaf5..cb6247f95a 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index c7f3ca2a20..eefc9f27f3 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -109,7 +109,9 @@ typedef struct PruneStepResult
 	bool		scan_null;		/* Scan the partition for NULL values? */
 } PruneStepResult;
 
-
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol,
+								 RelOptInfo *partrel);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 							  RelOptInfo *parentrel,
 							  int *relid_subplan_map,
@@ -176,7 +178,140 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.  Restriction clauses like WHERE
+ * partkeycol = constant, get turned into an EquivalenceClass containing a
+ * constant, which is recognized as redundant by build_partition_pathkeys().
+ * But if the partition column is a boolean variable (or expression), then we
+ * are not going to see WHERE partkeycol = constant, because expression
+ * preprocessing will have simplified that to "WHERE partkeycol" or
+ * "WHERE NOT partkeycol".  So we are not going to have a matching
+ * EquivalenceClass (unless the query also contains "ORDER BY partkeycol").
+ * To allow such cases to work the same as they would for non-boolean values,
+ * this function is provided to detect whether the specified partkey column
+ * matches a boolean restriction clause.
+ */
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme		partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
 
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+								 RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *) rinfo->clause;
+	Expr	   *partexpr = (Expr *) linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *) get_notclausearg((Expr *) clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..0bab42e853 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1361,6 +1361,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..09a9884d7c 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partkey_is_bool_constant_for_query(struct RelOptInfo *partrel,
+								   int partkeycol);
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 565d947b6d..0b06f89104 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2042,7 +2042,231 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_b_a_idx on bool_rp_true
+         Index Cond: (b = true)
+(3 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_b_a_idx on bool_rp_false
+         Index Cond: (b = false)
+(3 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
@@ -2055,17 +2279,15 @@ explain (costs off) select min(a), max(a) from parted_minmax where b = '12345';
  Result
    InitPlan 1 (returns $0)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1.a
+           ->  Append
                  ->  Index Only Scan using parted_minmax1i on parted_minmax1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
    InitPlan 2 (returns $1)
      ->  Limit
-           ->  Merge Append
-                 Sort Key: parted_minmax1_1.a DESC
+           ->  Append
                  ->  Index Only Scan Backward using parted_minmax1i on parted_minmax1 parted_minmax1_1
                        Index Cond: ((a IS NOT NULL) AND (b = '12345'::text))
-(13 rows)
+(11 rows)
 
 select min(a), max(a) from parted_minmax where b = '12345';
  min | max 
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 30946f77b6..2c9e42b51a 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3063,14 +3063,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3117,17 +3117,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3140,13 +3138,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3159,12 +3156,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3173,23 +3169,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..897f199b9d 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,109 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true partition of bool_rp for values from (true,0) to (true,1000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index dc327caffd..102893b6f6 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#61

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: David Rowley (#60)

Re: Ordered Partitioned Table Scans

On Sun, Mar 24, 2019 at 11:06 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Sat, 23 Mar 2019 at 19:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

On Sat, 23 Mar 2019 at 05:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, another thing we could possibly do to answer this objection is to
give the ordered-Append node an artificially pessimistic startup cost,
such as the sum or the max of its children's startup costs. That's
pretty ugly and unprincipled, but maybe it's better than not having the
ability to generate the plan shape at all?

I admit to having thought of that while trying to get to sleep last
night, but I was too scared to even suggest it. It's pretty much how
MergeAppend would cost it anyway. I agree it's not pretty to lie
about the startup cost, but it does kinda seem silly to fall back on a
more expensive MergeAppend when we know fine well Append is cheaper.

Yeah. I'm starting to think that this might actually be the way to go,

Here's a version with it done that way.

FTR this patch doesn't apply since single child [Merge]Append
suppression (8edd0e7946) has been pushed.

#62

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Julien Rouhaud (#61)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Tue, 26 Mar 2019 at 09:02, Julien Rouhaud <rjuju123@gmail.com> wrote:

FTR this patch doesn't apply since single child [Merge]Append
suppression (8edd0e7946) has been pushed.

Thanks for letting me know. I've attached v14 based on current master.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v14.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v14.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..755ef43caa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index da0d778721..c4ac4e2ee9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1552,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1594,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1644,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,7 +1706,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 
@@ -1734,44 +1735,82 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
-											NULL, path->parallel_workers,
-											true,
-											partitioned_rels, partial_rows);
+											NIL, NULL, path->parallel_workers,
+											true, partitioned_rels,
+											partial_rows);
 			add_partial_path(rel, (Path *) appendpath);
 		}
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+												&partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+													BackwardScanDirection,
+											&partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on
+		 * (a, b), and a query with an ORDER BY a, b, c.  We can still allow
+		 * an Append scan in this case.  Imagine each partition has a btree
+		 * index on (a, b, c), scanning those indexes still provides tuples in
+		 * the correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1780,6 +1819,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+						(!partition_pathkeys_partial &&
+						 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+								(pathkeys_contained_in(pathkeys,
+												partition_pathkeys_desc) ||
+						(!partition_pathkeys_desc_partial &&
+							pathkeys_contained_in(partition_pathkeys_desc,
+												  pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1822,26 +1884,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
 														rel,
 														startup_subpaths,
+														NIL,
 														pathkeys,
 														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
-			add_path(rel, (Path *) create_merge_append_path(root,
+														0,
+														false,
+														partitioned_rels,
+														-1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
 															rel,
 															total_subpaths,
+															NIL,
+															pathkeys,
+															NULL,
+															0,
+															false,
+															partitioned_rels,
+															-1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
+			add_path(rel, (Path *) create_merge_append_path(root,
+															rel,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1982,6 +2104,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2005,7 +2155,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..f3f9c421a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 rows and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+							  root,
+							  pathkeys,
+							  subpath->total_cost,
+							  subpath->parent->tuples,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem,
+							  apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..7044899dc1 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..847e6a819d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,8 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +548,75 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme		partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											 ScanDirectionIsBackward(scandir),
+											 ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 979c3c212f..49f02ee1e6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * won't match the parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..02f805129f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1597,7 +1597,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3878,6 +3879,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..ed253288f8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,6 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1251,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths
+	 * when the Append has valid pathkeys.  The order they're listed in
+	 * is critical to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1262,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1291,10 +1305,7 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.pathkeys = child->pathkeys;
 	}
 	else
-	{
-		pathnode->path.pathkeys = NIL;	/* unsorted if more than 1 subpath */
-		cost_append(pathnode);
-	}
+		cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3736,7 +3747,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 803c23aaf5..cb6247f95a 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index af3f91133e..9fe9b3762b 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -109,7 +109,9 @@ typedef struct PruneStepResult
 	bool		scan_null;		/* Scan the partition for NULL values? */
 } PruneStepResult;
 
-
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol,
+								 RelOptInfo *partrel);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 							  RelOptInfo *parentrel,
 							  int *relid_subplan_map,
@@ -176,7 +178,140 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.  Restriction clauses like WHERE
+ * partkeycol = constant, get turned into an EquivalenceClass containing a
+ * constant, which is recognized as redundant by build_partition_pathkeys().
+ * But if the partition column is a boolean variable (or expression), then we
+ * are not going to see WHERE partkeycol = constant, because expression
+ * preprocessing will have simplified that to "WHERE partkeycol" or
+ * "WHERE NOT partkeycol".  So we are not going to have a matching
+ * EquivalenceClass (unless the query also contains "ORDER BY partkeycol").
+ * To allow such cases to work the same as they would for non-boolean values,
+ * this function is provided to detect whether the specified partkey column
+ * matches a boolean restriction clause.
+ */
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme		partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
 
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+								 RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *) rinfo->clause;
+	Expr	   *partexpr = (Expr *) linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *) get_notclausearg((Expr *) clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in an earlier partition.  Returns
+ *		false this is not possible, or if we have insufficient means to prove
+ *		it.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo	boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		/*
+		 * RANGE type partitions guarantee that the partitions can be scanned
+		 * in the order that they're defined in the PartitionDesc to provide
+		 * non-overlapping ranges of tuples.  We must disallow when a DEFAULT
+		 * partition exists as this could contain tuples from either below or
+		 * above the defined range, or contain tuples belonging to gaps in the
+		 * defined range.
+		 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		/*
+		 * LIST partitions can also guarantee ordering, but we'd need to
+		 * ensure that partitions don't allow interleaved values.  We could
+		 * likely check for this looking at each partition, in order, and
+		 * checking which Datums are accepted.  If we find a Datum in a
+		 * partition that's greater than one previously already seen, then
+		 * values could become out of order and we'd have to disable the
+		 * optimization.  For now, let's just keep it simple and just accept
+		 * LIST partitions without a DEFAULT partition which only accept a
+		 * single Datum per partition.  This is cheap as it does not require
+		 * any per-partition processing.  Maybe we'd like to handle more
+		 * complex cases in the future.
+		 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 /*
  * make_partition_pruneinfo
  *		Builds a PartitionPruneInfo which can be used in the executor to allow
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..0bab42e853 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1361,6 +1361,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..09a9884d7c 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partkey_is_bool_constant_for_query(struct RelOptInfo *partrel,
+								   int partkeycol);
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 7518148df0..a94f44a652 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2037,7 +2037,237 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_1k_b_a_idx on bool_rp_true_1k
+         Index Cond: (b = true)
+   ->  Index Only Scan using bool_rp_true_2k_b_a_idx on bool_rp_true_2k
+         Index Cond: (b = true)
+(5 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_1k_b_a_idx on bool_rp_false_1k
+         Index Cond: (b = false)
+   ->  Index Only Scan using bool_rp_false_2k_b_a_idx on bool_rp_false_2k
+         Index Cond: (b = false)
+(5 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 50ca03b9e3..837a57e817 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3024,14 +3024,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3078,17 +3078,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3101,13 +3099,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3120,12 +3117,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3134,23 +3130,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..89a0f8c229 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index a5514c7506..227dab630d 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#63

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: David Rowley (#62)

Re: Ordered Partitioned Table Scans

On Tue, Mar 26, 2019 at 3:13 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Tue, 26 Mar 2019 at 09:02, Julien Rouhaud <rjuju123@gmail.com> wrote:

FTR this patch doesn't apply since single child [Merge]Append
suppression (8edd0e7946) has been pushed.

Thanks for letting me know. I've attached v14 based on current master.

Thanks!

So, AFAICT everything works as intended, I don't see any problem in
the code and the special costing heuristic should avoid dramatic
plans.

A few, mostly nitpicking, comments:

+   if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+       partitions_are_ordered(root, rel))

shouldn't the test be IS_PARTITIONED_REL(rel) instead of testing
part_scheme? I'm thinking of 1d33858406 and related discussions.

+ * partitions_are_ordered
+ *     For the partitioned table given in 'partrel', returns true if the
+ *     partitioned table guarantees that tuples which sort earlier according
+ *     to the partition bound are stored in an earlier partition.  Returns
+ *     false this is not possible, or if we have insufficient means to prove
+ *     it.
[...]
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.

Maybe it's because I'm not a native english speaker, but I had to read
those comments multiple time. I'd also add to partitions_are_ordered
comments a note about default_partition (even though the function is
super short).

+           if (boundinfo->ndatums +
partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+               return false;

there are a few over lengthy lines, maybe a pgindent run would be useful.

+ * build_partition_pathkeys
+ *   Build a pathkeys list that describes the ordering induced by the
+ *   partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *   table guarantees that lower order tuples never will be found in a
+ *   later partition.).  Sets *partialkeys to false if pathkeys were only
+ *   built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+                        ScanDirection scandir, bool *partialkeys)

Maybe add an assert partitions_are_ordered also?

And finally, should this optimisation be mentioned in ddl.sgml (or
somewhere else)?

#64

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Julien Rouhaud (#63)

1 attachment(s)

Re: Ordered Partitioned Table Scans

Thanks for having another look.

On Wed, 27 Mar 2019 at 00:22, Julien Rouhaud <rjuju123@gmail.com> wrote:

A few, mostly nitpicking, comments:
+   if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+       partitions_are_ordered(root, rel))
shouldn't the test be IS_PARTITIONED_REL(rel) instead of testing
part_scheme? I'm thinking of 1d33858406 and related discussions.

I don't think it's really needed. There must be > 0 partitions in this
case as if there were either 0 partitions or all partitions had been
pruned then the partitioned table would have been turned into a dummy
rel in set_append_rel_size(), and we'd never try to generate paths for
it. There are also quite a number of other places where we do the same
in add_paths_to_append_rel().

+ * partitions_are_ordered
+ *     For the partitioned table given in 'partrel', returns true if the
+ *     partitioned table guarantees that tuples which sort earlier according
+ *     to the partition bound are stored in an earlier partition.  Returns
+ *     false this is not possible, or if we have insufficient means to prove
+ *     it.
[...]
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then we needn't take the key into consideration
+ * when checking if scanning partitions in order can't cause lower-order
+ * values to appear in later partitions.

Maybe it's because I'm not a native english speaker, but I had to read
those comments multiple time.

I've changed the wording of these a bit. I ended up aligning
partkey_is_bool_constant_for_query() with its cousin
indexcol_is_bool_constant_for_query(). Previously I'd tried to make
the comment contain a bit more detail about what calls it, but I've
now removed that part and replaced it with "then it's irrelevant for
sort-order considerations".

I'd also add to partitions_are_ordered
comments a note about default_partition (even though the function is
super short).

hmm. The comments there do mention default partitions in each place
it's relevant. It's not relevant to mention anything about default
partitions in the header comment of the function since callers don't
need to know about implementation details. They just need details of
what the function does and what callers need to know. If we invent
some other naturally ordered partition strategy in the future that
does not allow default partitions then a comment in the function
header about default partitions would be not only irrelevant but also
confusing.

+           if (boundinfo->ndatums +
partition_bound_accepts_nulls(boundinfo) != partrel->nparts)
+               return false;
there are a few over lengthy lines, maybe a pgindent run would be useful.

I've run pgindent. It won't wrap that line, so I wrapped it manually.
I don't think it's become any more pretty for it though.

+ * build_partition_pathkeys
+ *   Build a pathkeys list that describes the ordering induced by the
+ *   partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *   table guarantees that lower order tuples never will be found in a
+ *   later partition.).  Sets *partialkeys to false if pathkeys were only
+ *   built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+                        ScanDirection scandir, bool *partialkeys)

Maybe add an assert partitions_are_ordered also?

Added that.

And finally, should this optimisation be mentioned in ddl.sgml (or
somewhere else)?

I'm not too sure about this. We don't generally detail out planner
optimisations in the docs. However, maybe it's worth users knowing
about it as it may control their design choices of partition
hierarchies. I'd just not be sure where exactly something should be
written. I suppose ideally there'd be a section in the docs for
planner optimisations which could contain a section on partitioned
tables which we could reference from the partitioned table docs in
ddl.sgml. That would be asking a bit much for this patch though.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v15.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v15.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..755ef43caa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index da0d778721..27699a59d1 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -96,15 +96,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1552,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1594,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1644,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,7 +1706,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 
@@ -1734,44 +1735,82 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
-											NULL, path->parallel_workers,
-											true,
-											partitioned_rels, partial_rows);
+											NIL, NULL, path->parallel_workers,
+											true, partitioned_rels,
+											partial_rows);
 			add_partial_path(rel, (Path *) appendpath);
 		}
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(root, rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+													  &partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+														   BackwardScanDirection,
+														   &partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on (a,
+		 * b), and a query with an ORDER BY a, b, c.  We can still allow an
+		 * Append scan in this case.  Imagine each partition has a btree index
+		 * on (a, b, c), scanning those indexes still provides tuples in the
+		 * correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1780,6 +1819,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+			(!partition_pathkeys_partial &&
+			 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+			(pathkeys_contained_in(pathkeys,
+								   partition_pathkeys_desc) ||
+			 (!partition_pathkeys_desc_partial &&
+			  pathkeys_contained_in(partition_pathkeys_desc,
+									pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1822,26 +1884,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
-														rel,
-														startup_subpaths,
-														pathkeys,
-														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+													  startup_subpaths,
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
 			add_path(rel, (Path *) create_merge_append_path(root,
 															rel,
-															total_subpaths,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1982,6 +2104,34 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2005,7 +2155,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..f3f9c421a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 rows and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+							  root,
+							  pathkeys,
+							  subpath->total_cost,
+							  subpath->parent->tuples,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem,
+							  apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..7044899dc1 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..e5580966f6 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -24,6 +24,8 @@
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
+#include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
 
@@ -546,6 +548,77 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	Assert(partitions_are_ordered(root, partrel));
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											  ScanDirectionIsBackward(scandir),
+											  ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 979c3c212f..49f02ee1e6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * won't match the parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e408e77d6f..02f805129f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1597,7 +1597,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3878,6 +3879,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..5e3309955f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,6 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1251,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths when
+	 * the Append has valid pathkeys.  The order they're listed in is critical
+	 * to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1262,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1291,10 +1305,7 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.pathkeys = child->pathkeys;
 	}
 	else
-	{
-		pathnode->path.pathkeys = NIL;	/* unsorted if more than 1 subpath */
-		cost_append(pathnode);
-	}
+		cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3736,7 +3747,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index bdd0d23854..630e619786 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -1680,6 +1680,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1699,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index af3f91133e..4f971a323c 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -109,7 +109,9 @@ typedef struct PruneStepResult
 	bool		scan_null;		/* Scan the partition for NULL values? */
 } PruneStepResult;
 
-
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol,
+								 RelOptInfo *partrel);
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 							  RelOptInfo *parentrel,
 							  int *relid_subplan_map,
@@ -176,6 +178,138 @@ static bool partkey_datum_from_expr(PartitionPruneContext *context,
 						Expr *expr, int stateidx,
 						Datum *value, bool *isnull);
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then it's irrelevant for sort-order
+ * considerations.  Restriction clauses like WHERE partkeycol = constant, get
+ * turned into an EquivalenceClass containing a constant, which is recognized
+ * as redundant by build_partition_pathkeys().  But if the partition column is
+ * a boolean variable (or expression), then we are not going to see WHERE
+ * partkeycol = constant, because expression preprocessing will have
+ * simplified that to "WHERE partkeycol" or "WHERE NOT partkeycol".  So we are
+ * not going to have a matching EquivalenceClass (unless the query also
+ * contains "ORDER BY partkeycol").  To allow such cases to work the same as
+ * they would for non-boolean values, this function is provided to detect
+ * whether the specified partkey column matches a boolean restriction clause.
+ */
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
+
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+								 RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *) rinfo->clause;
+	Expr	   *partexpr = (Expr *) linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *) get_notclausearg((Expr *) clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in a partition that comes earlier in
+ *		the relation's PartitionDesc.  Otherwise return false.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
+{
+	PartitionBoundInfo boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+			/*
+			 * RANGE type partitions guarantee that the partitions can be
+			 * scanned in the order that they're defined in the PartitionDesc
+			 * to provide non-overlapping ranges of tuples.  We must disallow
+			 * when a DEFAULT partition exists as this could contain tuples
+			 * from either below or above the defined range, or contain tuples
+			 * belonging to gaps in the defined range.
+			 */
+		case PARTITION_STRATEGY_RANGE:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+			/*
+			 * LIST partitions can also guarantee ordering, but we'd need to
+			 * ensure that partitions don't allow interleaved values.  We
+			 * could likely check for this looking at each partition, in
+			 * order, and checking which Datums are accepted.  If we find a
+			 * Datum in a partition that's greater than one previously already
+			 * seen, then values could become out of order and we'd have to
+			 * disable the optimization.  For now, let's just keep it simple
+			 * and just accept LIST partitions without a DEFAULT partition
+			 * which only accept a single Datum per partition.  This is cheap
+			 * as it does not require any per-partition processing.  Maybe
+			 * we'd like to handle more complex cases in the future.
+			 */
+		case PARTITION_STRATEGY_LIST:
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
+				!= partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
 
 /*
  * make_partition_pruneinfo
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 253e0b7e48..0bab42e853 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1361,6 +1361,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 574bb85b50..1bcd0e4235 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..09a9884d7c 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,10 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+extern bool partkey_is_bool_constant_for_query(struct RelOptInfo *partrel,
+								   int partkeycol);
+extern bool partitions_are_ordered(struct PlannerInfo *root,
+					   struct RelOptInfo *partrel);
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 						 struct RelOptInfo *parentrel,
 						 List *subpaths,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 7518148df0..a94f44a652 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2037,7 +2037,237 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_1k_b_a_idx on bool_rp_true_1k
+         Index Cond: (b = true)
+   ->  Index Only Scan using bool_rp_true_2k_b_a_idx on bool_rp_true_2k
+         Index Cond: (b = true)
+(5 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_1k_b_a_idx on bool_rp_false_1k
+         Index Cond: (b = false)
+   ->  Index Only Scan using bool_rp_false_2k_b_a_idx on bool_rp_false_2k
+         Index Cond: (b = false)
+(5 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 50ca03b9e3..837a57e817 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3024,14 +3024,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3078,17 +3078,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3101,13 +3099,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3120,12 +3117,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3134,23 +3130,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..89a0f8c229 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index a5514c7506..227dab630d 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#65

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: David Rowley (#64)

Re: Ordered Partitioned Table Scans

Hi David,

Sorry if this was discussed before, but why does this patch add any new
code to partprune.c? AFAICT, there's no functionality changes to the
pruning code.

Both

+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)

and

+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+                                 RelOptInfo *partrel)

seem like their logic is specialized enough to be confined to pathkeys.c,
only because it's needed there.

Regarding

+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)

I think this could simply be:

bool
partitions_are_ordered(PartitionBoundInfo *boundinfo)

and be defined in partitioning/partbounds.c. If you think any future
modifications to this will require access to the partition key info in
PartitionScheme, maybe the following is fine:

bool
partitions_are_ordered(RelOptInfo *partrel)

Thanks,
Amit

#66

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: Amit Langote (#65)

Re: Ordered Partitioned Table Scans

On 2019/03/27 15:48, Amit Langote wrote:

Hi David,

Sorry if this was discussed before, but why does this patch add any new
code to partprune.c? AFAICT, there's no functionality changes to the
pruning code.

Both
+bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
and
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+                                 RelOptInfo *partrel)
seem like their logic is specialized enough to be confined to pathkeys.c,
only because it's needed there.

Regarding
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
I think this could simply be:

bool
partitions_are_ordered(PartitionBoundInfo *boundinfo)

and be defined in partitioning/partbounds.c. If you think any future
modifications to this will require access to the partition key info in
PartitionScheme, maybe the following is fine:

bool
partitions_are_ordered(RelOptInfo *partrel)

Noticed a typo.

+                 * multiple subpaths then we can't make guarantees about the
+                 * order tuples in those subpaths, so we must leave the

order of tuples?

Again, sorry if this was discussed, but I got curious about why
partitions_are_ordered() thinks it can say true even for an otherwise
ordered list-partitioned table, but containing a null-only partition,
which is *always* scanned last. If a query says ORDER BY partkey NULLS
FIRST, then it's not alright to proceed with assuming partitions are
ordered even if partitions_are_ordered() said so.

Related, why does build_partition_pathkeys() pass what it does for
nulls_first parameter of make_pathkey_from_sortinfo()?

cpathkey = make_pathkey_from_sortinfo(root,
keyCol,
NULL,
partscheme->partopfamily[i],
partscheme->partopcintype[i],
partscheme->partcollation[i],
ScanDirectionIsBackward(scandir),
==> ScanDirectionIsBackward(scandir),
0,
partrel->relids,
false);

I think null values are almost always placed in the last partition, unless
the null-accepting list partition also accepts some other non-null value.
I'm not sure exactly how we can determine the correct value to pass here,
but what's there in the patch now doesn't seem to be it.

Thanks,
Amit

#67

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#65)

Re: Ordered Partitioned Table Scans

On Wed, 27 Mar 2019 at 19:48, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry if this was discussed before, but why does this patch add any new
code to partprune.c? AFAICT, there's no functionality changes to the
pruning code.

You're right. It probably shouldn't be there. There's a bit of a lack
of a good home for partition code relating to the planner it seems.

seem like their logic is specialized enough to be confined to pathkeys.c,
only because it's needed there.

Yeah maybe.

Regarding
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
I think this could simply be:

bool
partitions_are_ordered(PartitionBoundInfo *boundinfo)

and be defined in partitioning/partbounds.c. If you think any future
modifications to this will require access to the partition key info in
PartitionScheme, maybe the following is fine:

bool
partitions_are_ordered(RelOptInfo *partrel)

It does need to know how many partitions the partitioned table has,
which it gets from partrel->nparts, so yeah, RelOptInfo is probably
needed. I don't think passing in int nparts is a good solution to
that. The problem with moving it to partbounds.c is that nothing
there knows about RelOptInfo currently.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#68

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#66)

Re: Ordered Partitioned Table Scans

On Wed, 27 Mar 2019 at 21:24, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Noticed a typo.

+                 * multiple subpaths then we can't make guarantees about the
+                 * order tuples in those subpaths, so we must leave the

order of tuples?

I'll fix that. Thanks.

Again, sorry if this was discussed, but I got curious about why
partitions_are_ordered() thinks it can say true even for an otherwise
ordered list-partitioned table, but containing a null-only partition,
which is *always* scanned last. If a query says ORDER BY partkey NULLS
FIRST, then it's not alright to proceed with assuming partitions are
ordered even if partitions_are_ordered() said so.

If it's *always* scanned last then it's fine for ORDER BY partkey
NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't
match on the pathkeys.

Or if they do ORDER BY partkey DESC NULLS FIRST, then we're also fine,
since we reverse the order of the subpaths list in that case. ORDER
BY partkey DESC NULLS LAST is not okay, and we don't optimise that
since it won't match the pathkeys we generate in
build_partition_pathkeys().

Related, why does build_partition_pathkeys() pass what it does for
nulls_first parameter of make_pathkey_from_sortinfo()?

cpathkey = make_pathkey_from_sortinfo(root,
keyCol,
NULL,
partscheme->partopfamily[i],
partscheme->partopcintype[i],
partscheme->partcollation[i],
ScanDirectionIsBackward(scandir),
==> ScanDirectionIsBackward(scandir),
0,
partrel->relids,
false);

I think null values are almost always placed in the last partition, unless
the null-accepting list partition also accepts some other non-null value.
I'm not sure exactly how we can determine the correct value to pass here,
but what's there in the patch now doesn't seem to be it.

The code looks okay to me. It'll generate pathkeys like ORDER BY
partkey NULLS LAST for forward scans and ORDER BY partkey DESC NULLS
FIRST for backwards scans.

Can you explain what cases you think the code gets wrong?

Here's a preview of the actual and expected behaviour:

# explain (costs off) select * from listp order by a asc nulls last;
QUERY PLAN
------------------------------------------------------------
Append
-> Index Only Scan using listp1_a_idx on listp1
-> Index Only Scan using listp2_a_idx on listp2
-> Index Only Scan using listp_null_a_idx on listp_null
(4 rows)

# explain (costs off) select * from listp order by a asc nulls first;
-- not optimised
QUERY PLAN
------------------------------------
Sort
Sort Key: listp1.a NULLS FIRST
-> Append
-> Seq Scan on listp1
-> Seq Scan on listp2
-> Seq Scan on listp_null
(6 rows)

# explain (costs off) select * from listp order by a desc nulls first;
-- subpath list is simply reversed in this case.
QUERY PLAN
---------------------------------------------------------------------
Append
-> Index Only Scan Backward using listp_null_a_idx on listp_null
-> Index Only Scan Backward using listp2_a_idx on listp2
-> Index Only Scan Backward using listp1_a_idx on listp1
(4 rows)

# explain (costs off) select * from listp order by a desc nulls last;
-- not optimised
QUERY PLAN
--------------------------------------
Sort
Sort Key: listp1.a DESC NULLS LAST
-> Append
-> Seq Scan on listp1
-> Seq Scan on listp2
-> Seq Scan on listp_null
(6 rows)

We could likely improve the two cases that are not optimized by
putting the NULL partition in the correct place in the append
subpaths, but for now, we don't really have an efficient means to
identify which subpath that is. I've not looked at your partition
planning improvements patch for a while to see if you're storing a
Bitmapset of the non-pruned partitions in RelOptInfo. Something like
that would allow us to make this better. Julien and I have talked
about other possible cases to optimise if we have that. e.g if the
default partition is pruned then we can optimise a RANGE partitioned
table with a default. So there's definitely more to be done on this. I
think there's a general consensus that what we're doing in the patch
already is enough to be useful.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#69

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: David Rowley (#67)

Re: Ordered Partitioned Table Scans

Hi,

On 2019/03/28 7:29, David Rowley wrote:

On Wed, 27 Mar 2019 at 19:48, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Sorry if this was discussed before, but why does this patch add any new
code to partprune.c? AFAICT, there's no functionality changes to the
pruning code.

You're right. It probably shouldn't be there. There's a bit of a lack
of a good home for partition code relating to the planner it seems.

partprune.c is outside the optimizer sub-directory, but exports
planning-related functions like prune_append_rel_partitions(),
make_partition_pruneinfo(), etc.

Similarly, partbound.c can grow bit of planning-related functionality.

Regarding
+bool
+partitions_are_ordered(PlannerInfo *root, RelOptInfo *partrel)
I think this could simply be:

bool
partitions_are_ordered(PartitionBoundInfo *boundinfo)

and be defined in partitioning/partbounds.c. If you think any future
modifications to this will require access to the partition key info in
PartitionScheme, maybe the following is fine:

bool
partitions_are_ordered(RelOptInfo *partrel)
It does need to know how many partitions the partitioned table has,
which it gets from partrel->nparts, so yeah, RelOptInfo is probably
needed. I don't think passing in int nparts is a good solution to
that. The problem with moving it to partbounds.c is that nothing
there knows about RelOptInfo currently.

Maybe, this could be a start. Also, there is a patch in nearby thread
which adds additional functionality to partbounds.c to be used by
partitionwise join code in the optimizer [1]https://commitfest.postgresql.org/22/1553/.

Thanks,
Amit

[1]: https://commitfest.postgresql.org/22/1553/

#70

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#69)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Thu, 28 Mar 2019 at 14:34, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2019/03/28 7:29, David Rowley wrote:

On Wed, 27 Mar 2019 at 19:48, Amit Langote
It does need to know how many partitions the partitioned table has,
which it gets from partrel->nparts, so yeah, RelOptInfo is probably
needed. I don't think passing in int nparts is a good solution to
that. The problem with moving it to partbounds.c is that nothing
there knows about RelOptInfo currently.

Maybe, this could be a start. Also, there is a patch in nearby thread
which adds additional functionality to partbounds.c to be used by
partitionwise join code in the optimizer [1].

Thanks for the review. I've attached a patch that mostly just moved
the code around.

I also changed the comment in build_partition_pathkeys() to explain
about the nulls_first argument and why I just pass
ScanDirectionIsBackward(scandir).

Also, another comment in struct PlannerInfo to mentioning the
guarantee about append_rel_list having earlier partitions as defined
in PartitionDesc earlier in the list.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v16.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v16.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..755ef43caa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 56a5084312..a1677a125d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -44,6 +44,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
+#include "partitioning/partbounds.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -96,15 +97,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1553,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1595,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1645,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,7 +1707,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 
@@ -1734,44 +1736,82 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
-											NULL, path->parallel_workers,
-											true,
-											partitioned_rels, partial_rows);
+											NIL, NULL, path->parallel_workers,
+											true, partitioned_rels,
+											partial_rows);
 			add_partial_path(rel, (Path *) appendpath);
 		}
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+													  &partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+														   BackwardScanDirection,
+														   &partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on (a,
+		 * b), and a query with an ORDER BY a, b, c.  We can still allow an
+		 * Append scan in this case.  Imagine each partition has a btree index
+		 * on (a, b, c), scanning those indexes still provides tuples in the
+		 * correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1780,6 +1820,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+			(!partition_pathkeys_partial &&
+			 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+			(pathkeys_contained_in(pathkeys,
+								   partition_pathkeys_desc) ||
+			 (!partition_pathkeys_desc_partial &&
+			  pathkeys_contained_in(partition_pathkeys_desc,
+									pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1822,26 +1885,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
-														rel,
-														startup_subpaths,
-														pathkeys,
-														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+													  startup_subpaths,
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
 			add_path(rel, (Path *) create_merge_append_path(root,
 															rel,
-															total_subpaths,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1982,6 +2105,36 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ *
+ * Note: 'path' must not be a parallel aware path.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2005,7 +2158,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..f3f9c421a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 rows and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+							  root,
+							  pathkeys,
+							  subpath->total_cost,
+							  subpath->parent->tuples,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem,
+							  apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..7044899dc1 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..aef11e0832 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -18,16 +18,20 @@
 #include "postgres.h"
 
 #include "access/stratnum.h"
+#include "catalog/pg_opfamily.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol, RelOptInfo *partrel);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -546,6 +550,153 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then it's irrelevant for sort-order
+ * considerations.  Restriction clauses like WHERE partkeycol = constant, get
+ * turned into an EquivalenceClass containing a constant, which is recognized
+ * as redundant by build_partition_pathkeys().  But if the partition column is
+ * a boolean variable (or expression), then we are not going to see WHERE
+ * partkeycol = constant, because expression preprocessing will have
+ * simplified that to "WHERE partkeycol" or "WHERE NOT partkeycol".  So we are
+ * not going to have a matching EquivalenceClass (unless the query also
+ * contains "ORDER BY partkeycol").  To allow such cases to work the same as
+ * they would for non-boolean values, this function is provided to detect
+ * whether the specified partkey column matches a boolean restriction clause.
+ */
+static bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
+
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *)lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+	RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *)rinfo->clause;
+	Expr	   *partexpr = (Expr *)linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *)get_notclausearg((Expr *)clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	Assert(partitions_are_ordered(partrel));
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 * A PartitionDesc always lists any NULL partition last, so we can
+		 * simply pass the ScanDirectionIsBackward(scandir) for nulls_first
+		 * since NULLS FIRST is the default for DESC, and NULLS LAST is the
+		 * default for ASC sort orders.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											  ScanDirectionIsBackward(scandir),
+											  ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 979c3c212f..49f02ee1e6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * won't match the parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ca7a0fbbf5..30fb9b9c14 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1593,7 +1593,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3866,6 +3867,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..5e3309955f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,6 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1251,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths when
+	 * the Append has valid pathkeys.  The order they're listed in is critical
+	 * to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1262,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1291,10 +1305,7 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.pathkeys = child->pathkeys;
 	}
 	else
-	{
-		pathnode->path.pathkeys = NIL;	/* unsorted if more than 1 subpath */
-		cost_append(pathnode);
-	}
+		cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3736,7 +3747,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index bdd0d23854..d5ce7079d4 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "parser/parse_coerce.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
@@ -861,6 +862,69 @@ partition_bounds_copy(PartitionBoundInfo src,
 	return dest;
 }
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in a partition that comes earlier in
+ *		the relation's PartitionDesc.  Otherwise return false.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(RelOptInfo *partrel)
+{
+	PartitionBoundInfo boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		case PARTITION_STRATEGY_RANGE:
+			/*
+			 * RANGE type partitions guarantee that the partitions can be
+			 * scanned in the order that they're defined in the PartitionDesc
+			 * to provide non-overlapping ranges of tuples.  We must disallow
+			 * when a DEFAULT partition exists as this could contain tuples
+			 * from either below or above the defined range, or contain tuples
+			 * belonging to gaps in the defined range.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		case PARTITION_STRATEGY_LIST:
+			/*
+			 * LIST partitions can also guarantee ordering, but we'd need to
+			 * ensure that partitions don't allow interleaved values.  We
+			 * could likely check for this looking at each partition, in
+			 * order, and checking which Datums are accepted.  If we find a
+			 * Datum in a partition that's greater than one previously already
+			 * seen, then values could become out of order and we'd have to
+			 * disable the optimization.  For now, let's just keep it simple
+			 * and just accept LIST partitions without a DEFAULT partition
+			 * which only accept a single Datum per partition.  This is cheap
+			 * as it does not require any per-partition processing.  Maybe
+			 * we'd like to handle more complex cases in the future.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
+				!= partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * check_new_partition_bound
  *
@@ -1680,6 +1744,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1763,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 88c8973f3c..426d7ec3aa 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -280,7 +280,13 @@ struct PlannerInfo
 
 	List	   *join_info_list; /* list of SpecialJoinInfos */
 
-	List	   *append_rel_list;	/* list of AppendRelInfos */
+	/*
+	 * list of AppendRelInfos.  For AppendRelInfos belonging to partitions of
+	 * a partitioned table, this list guarantees that partitions that come
+	 * earlier in the partitioned table's PartitionDesc will come earlier in
+	 * this list.
+	 */
+	List	   *append_rel_list;
 
 	List	   *rowMarks;		/* list of PlanRowMarks */
 
@@ -1366,6 +1372,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9e79e1cd63..e0c7aa5f23 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 683e1574ea..526f1e5e77 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -88,6 +88,7 @@ extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   PartitionBoundInfo b2);
 extern PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src,
 					  PartitionKey key);
+extern bool partitions_are_ordered(struct RelOptInfo *partrel);
 extern void check_new_partition_bound(char *relname, Relation parent,
 						  PartitionBoundSpec *spec);
 extern void check_default_partition_contents(Relation parent,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 7518148df0..a94f44a652 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2037,7 +2037,237 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_1k_b_a_idx on bool_rp_true_1k
+         Index Cond: (b = true)
+   ->  Index Only Scan using bool_rp_true_2k_b_a_idx on bool_rp_true_2k
+         Index Cond: (b = true)
+(5 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_1k_b_a_idx on bool_rp_false_1k
+         Index Cond: (b = false)
+   ->  Index Only Scan using bool_rp_false_2k_b_a_idx on bool_rp_false_2k
+         Index Cond: (b = false)
+(5 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 50ca03b9e3..837a57e817 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3024,14 +3024,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3078,17 +3078,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3101,13 +3099,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3120,12 +3117,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3134,23 +3130,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..89a0f8c229 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index a5514c7506..227dab630d 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#71

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: David Rowley (#70)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Thu, 28 Mar 2019 at 15:40, David Rowley <david.rowley@2ndquadrant.com> wrote:

Thanks for the review. I've attached a patch that mostly just moved
the code around.

Here's one that fixes up the compiler warning from the last one.
Thanks CF bot...

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v17.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v17.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 910a738c20..755ef43caa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 56a5084312..a1677a125d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -44,6 +44,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
+#include "partitioning/partbounds.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -96,15 +97,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1551,7 +1553,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1593,7 +1595,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1643,19 +1645,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1705,7 +1707,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 
@@ -1734,44 +1736,82 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
-											NULL, path->parallel_workers,
-											true,
-											partitioned_rels, partial_rows);
+											NIL, NULL, path->parallel_workers,
+											true, partitioned_rels,
+											partial_rows);
 			add_partial_path(rel, (Path *) appendpath);
 		}
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+													  &partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+														   BackwardScanDirection,
+														   &partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on (a,
+		 * b), and a query with an ORDER BY a, b, c.  We can still allow an
+		 * Append scan in this case.  Imagine each partition has a btree index
+		 * on (a, b, c), scanning those indexes still provides tuples in the
+		 * correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1780,6 +1820,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+			(!partition_pathkeys_partial &&
+			 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+			(pathkeys_contained_in(pathkeys,
+								   partition_pathkeys_desc) ||
+			 (!partition_pathkeys_desc_partial &&
+			  pathkeys_contained_in(partition_pathkeys_desc,
+									pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1822,26 +1885,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
-														rel,
-														startup_subpaths,
-														pathkeys,
-														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+													  startup_subpaths,
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
 			add_path(rel, (Path *) create_merge_append_path(root,
 															rel,
-															total_subpaths,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1982,6 +2105,36 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ *
+ * Note: 'path' must not be a parallel aware path.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -2005,7 +2158,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..f3f9c421a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 rows and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+							  root,
+							  pathkeys,
+							  subpath->total_cost,
+							  subpath->parent->tuples,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem,
+							  apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9604a54b77..7044899dc1 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1264,7 +1264,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..aef11e0832 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -18,16 +18,20 @@
 #include "postgres.h"
 
 #include "access/stratnum.h"
+#include "catalog/pg_opfamily.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol, RelOptInfo *partrel);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -546,6 +550,153 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then it's irrelevant for sort-order
+ * considerations.  Restriction clauses like WHERE partkeycol = constant, get
+ * turned into an EquivalenceClass containing a constant, which is recognized
+ * as redundant by build_partition_pathkeys().  But if the partition column is
+ * a boolean variable (or expression), then we are not going to see WHERE
+ * partkeycol = constant, because expression preprocessing will have
+ * simplified that to "WHERE partkeycol" or "WHERE NOT partkeycol".  So we are
+ * not going to have a matching EquivalenceClass (unless the query also
+ * contains "ORDER BY partkeycol").  To allow such cases to work the same as
+ * they would for non-boolean values, this function is provided to detect
+ * whether the specified partkey column matches a boolean restriction clause.
+ */
+static bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
+
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *)lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+	RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *)rinfo->clause;
+	Expr	   *partexpr = (Expr *)linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *)get_notclausearg((Expr *)clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	Assert(partitions_are_ordered(partrel));
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 * A PartitionDesc always lists any NULL partition last, so we can
+		 * simply pass the ScanDirectionIsBackward(scandir) for nulls_first
+		 * since NULLS FIRST is the default for DESC, and NULLS LAST is the
+		 * default for ASC sort orders.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											  ScanDirectionIsBackward(scandir),
+											  ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 979c3c212f..49f02ee1e6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * won't match the parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ca7a0fbbf5..30fb9b9c14 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1593,7 +1593,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3866,6 +3867,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..5e3309955f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,6 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1251,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths when
+	 * the Append has valid pathkeys.  The order they're listed in is critical
+	 * to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1262,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1291,10 +1305,7 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.pathkeys = child->pathkeys;
 	}
 	else
-	{
-		pathnode->path.pathkeys = NIL;	/* unsorted if more than 1 subpath */
-		cost_append(pathnode);
-	}
+		cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3736,7 +3747,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index bdd0d23854..87c0c30988 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "parser/parse_coerce.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
@@ -861,6 +862,69 @@ partition_bounds_copy(PartitionBoundInfo src,
 	return dest;
 }
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that tuples which sort earlier according
+ *		to the partition bound are stored in a partition that comes earlier in
+ *		the relation's PartitionDesc.  Otherwise return false.
+ *
+ * This assumes nothing about the order of tuples inside the actual
+ * partitions.
+ */
+bool
+partitions_are_ordered(struct RelOptInfo *partrel)
+{
+	PartitionBoundInfo boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		case PARTITION_STRATEGY_RANGE:
+			/*
+			 * RANGE type partitions guarantee that the partitions can be
+			 * scanned in the order that they're defined in the PartitionDesc
+			 * to provide non-overlapping ranges of tuples.  We must disallow
+			 * when a DEFAULT partition exists as this could contain tuples
+			 * from either below or above the defined range, or contain tuples
+			 * belonging to gaps in the defined range.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		case PARTITION_STRATEGY_LIST:
+			/*
+			 * LIST partitions can also guarantee ordering, but we'd need to
+			 * ensure that partitions don't allow interleaved values.  We
+			 * could likely check for this looking at each partition, in
+			 * order, and checking which Datums are accepted.  If we find a
+			 * Datum in a partition that's greater than one previously already
+			 * seen, then values could become out of order and we'd have to
+			 * disable the optimization.  For now, let's just keep it simple
+			 * and just accept LIST partitions without a DEFAULT partition
+			 * which only accept a single Datum per partition.  This is cheap
+			 * as it does not require any per-partition processing.  Maybe
+			 * we'd like to handle more complex cases in the future.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
+				!= partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * check_new_partition_bound
  *
@@ -1680,6 +1744,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1763,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 88c8973f3c..426d7ec3aa 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -280,7 +280,13 @@ struct PlannerInfo
 
 	List	   *join_info_list; /* list of SpecialJoinInfos */
 
-	List	   *append_rel_list;	/* list of AppendRelInfos */
+	/*
+	 * list of AppendRelInfos.  For AppendRelInfos belonging to partitions of
+	 * a partitioned table, this list guarantees that partitions that come
+	 * earlier in the partitioned table's PartitionDesc will come earlier in
+	 * this list.
+	 */
+	List	   *append_rel_list;
 
 	List	   *rowMarks;		/* list of PlanRowMarks */
 
@@ -1366,6 +1372,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9e79e1cd63..e0c7aa5f23 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 683e1574ea..215a87f59b 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -75,6 +75,8 @@ typedef struct PartitionBoundInfoData
 #define partition_bound_accepts_nulls(bi) ((bi)->null_index != -1)
 #define partition_bound_has_default(bi) ((bi)->default_index != -1)
 
+struct RelOptInfo;
+
 extern int	get_hash_partition_greatest_modulus(PartitionBoundInfo b);
 extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 							 Oid *partcollation,
@@ -88,6 +90,7 @@ extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   PartitionBoundInfo b2);
 extern PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src,
 					  PartitionKey key);
+extern bool partitions_are_ordered(struct RelOptInfo *partrel);
 extern void check_new_partition_bound(char *relname, Relation parent,
 						  PartitionBoundSpec *spec);
 extern void check_default_partition_contents(Relation parent,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 7518148df0..a94f44a652 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2037,7 +2037,237 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_1k_b_a_idx on bool_rp_true_1k
+         Index Cond: (b = true)
+   ->  Index Only Scan using bool_rp_true_2k_b_a_idx on bool_rp_true_2k
+         Index Cond: (b = true)
+(5 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_1k_b_a_idx on bool_rp_false_1k
+         Index Cond: (b = false)
+   ->  Index Only Scan using bool_rp_false_2k_b_a_idx on bool_rp_false_2k
+         Index Cond: (b = false)
+(5 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 50ca03b9e3..837a57e817 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3024,14 +3024,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3078,17 +3078,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3101,13 +3099,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3120,12 +3117,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3134,23 +3130,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..89a0f8c229 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index a5514c7506..227dab630d 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -768,15 +768,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -797,7 +797,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#72

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: David Rowley (#68)

Re: Ordered Partitioned Table Scans

Hi David,

On 2019/03/28 8:04, David Rowley wrote:

On Wed, 27 Mar 2019 at 21:24, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
Noticed a typo.
+                 * multiple subpaths then we can't make guarantees about the
+                 * order tuples in those subpaths, so we must leave the
order of tuples?
I'll fix that. Thanks.

Again, sorry if this was discussed, but I got curious about why
partitions_are_ordered() thinks it can say true even for an otherwise
ordered list-partitioned table, but containing a null-only partition,
which is *always* scanned last. If a query says ORDER BY partkey NULLS
FIRST, then it's not alright to proceed with assuming partitions are
ordered even if partitions_are_ordered() said so.

If it's *always* scanned last then it's fine for ORDER BY partkey
NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't
match on the pathkeys.

Sorry, I had meant to say that null values may or may not appear in the
last partition depending on how the null-accepting partition is defined.
I see that the code in partitions_are_ordered() correctly returns false if
null partition is not the last one, for example, for:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2_null partition of p for values in (2, null);
create table p3 partition of p for values in (3);

Maybe, a small comment regarding how that works correctly would be nice.

Or if they do ORDER BY partkey DESC NULLS FIRST, then we're also fine,
since we reverse the order of the subpaths list in that case. ORDER
BY partkey DESC NULLS LAST is not okay, and we don't optimise that
since it won't match the pathkeys we generate in
build_partition_pathkeys().

OK.

Related, why does build_partition_pathkeys() pass what it does for
nulls_first parameter of make_pathkey_from_sortinfo()?

cpathkey = make_pathkey_from_sortinfo(root,
keyCol,
NULL,
partscheme->partopfamily[i],
partscheme->partopcintype[i],
partscheme->partcollation[i],
ScanDirectionIsBackward(scandir),
==> ScanDirectionIsBackward(scandir),
0,
partrel->relids,
false);

I think null values are almost always placed in the last partition, unless
the null-accepting list partition also accepts some other non-null value.
I'm not sure exactly how we can determine the correct value to pass here,
but what's there in the patch now doesn't seem to be it.

The code looks okay to me. It'll generate pathkeys like ORDER BY
partkey NULLS LAST for forward scans and ORDER BY partkey DESC NULLS
FIRST for backwards scans.

Can you explain what cases you think the code gets wrong?

Here's a preview of the actual and expected behaviour:

# explain (costs off) select * from listp order by a asc nulls last;
QUERY PLAN
------------------------------------------------------------
Append
-> Index Only Scan using listp1_a_idx on listp1
-> Index Only Scan using listp2_a_idx on listp2
-> Index Only Scan using listp_null_a_idx on listp_null
(4 rows)

# explain (costs off) select * from listp order by a asc nulls first;
-- not optimised
QUERY PLAN
------------------------------------
Sort
Sort Key: listp1.a NULLS FIRST
-> Append
-> Seq Scan on listp1
-> Seq Scan on listp2
-> Seq Scan on listp_null
(6 rows)

# explain (costs off) select * from listp order by a desc nulls first;
-- subpath list is simply reversed in this case.
QUERY PLAN
---------------------------------------------------------------------
Append
-> Index Only Scan Backward using listp_null_a_idx on listp_null
-> Index Only Scan Backward using listp2_a_idx on listp2
-> Index Only Scan Backward using listp1_a_idx on listp1
(4 rows)

# explain (costs off) select * from listp order by a desc nulls last;
-- not optimised
QUERY PLAN
--------------------------------------
Sort
Sort Key: listp1.a DESC NULLS LAST
-> Append
-> Seq Scan on listp1
-> Seq Scan on listp2
-> Seq Scan on listp_null
(6 rows)

Ah, everything seems to be working correctly. Thanks for the explanation
and sorry about the noise.

We could likely improve the two cases that are not optimized by
putting the NULL partition in the correct place in the append
subpaths, but for now, we don't really have an efficient means to
identify which subpath that is. I've not looked at your partition
planning improvements patch for a while to see if you're storing a
Bitmapset of the non-pruned partitions in RelOptInfo. Something like
that would allow us to make this better. Julien and I have talked
about other possible cases to optimise if we have that. e.g if the
default partition is pruned then we can optimise a RANGE partitioned
table with a default. So there's definitely more to be done on this. I
think there's a general consensus that what we're doing in the patch
already is enough to be useful.

Certainly. When trying out various combinations of ORDER BY ASC/DESC NULL
FIRST/LAST yesterday, I wrongly thought the plan came out wrong in one or
two cases, but now see that that's not the case.

Also, the comment you added in the latest patch sheds some light on the
matter, so that helps too.

Thanks,
Amit

#73

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#72)

Re: Ordered Partitioned Table Scans

On Fri, 29 Mar 2019 at 00:00, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2019/03/28 8:04, David Rowley wrote:

If it's *always* scanned last then it's fine for ORDER BY partkey
NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't
match on the pathkeys.

Sorry, I had meant to say that null values may or may not appear in the
last partition depending on how the null-accepting partition is defined.
I see that the code in partitions_are_ordered() correctly returns false if
null partition is not the last one, for example, for:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2_null partition of p for values in (2, null);
create table p3 partition of p for values in (3);

Maybe, a small comment regarding how that works correctly would be nice.

hmm, but there is a comment. It says:

/*
* LIST partitions can also guarantee ordering, but we'd need to
* ensure that partitions don't allow interleaved values. We
* could likely check for this looking at each partition, in
* order, and checking which Datums are accepted. If we find a
* Datum in a partition that's greater than one previously already
* seen, then values could become out of order and we'd have to
* disable the optimization. For now, let's just keep it simple
* and just accept LIST partitions without a DEFAULT partition
* which only accept a single Datum per partition. This is cheap
* as it does not require any per-partition processing. Maybe
* we'd like to handle more complex cases in the future.
*/

Your example above breaks the "don't allow interleaved values" and
"just accept LIST partitions without a DEFAULT partition which only
accept a single Datum per partition."

Do you think I need to add something like "and if there is a NULL
partition, that it only accepts NULL values"? I think that's implied
already, but if you think it's confusing then maybe it's worth adding
something along those lines.

David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#74

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: David Rowley (#73)

Re: Ordered Partitioned Table Scans

Hi,

On 2019/03/29 7:59, David Rowley wrote:

On Fri, 29 Mar 2019 at 00:00, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2019/03/28 8:04, David Rowley wrote:

If it's *always* scanned last then it's fine for ORDER BY partkey
NULLS LAST. If they have ORDER BY partkey NULLS FIRST then we won't
match on the pathkeys.

Sorry, I had meant to say that null values may or may not appear in the
last partition depending on how the null-accepting partition is defined.
I see that the code in partitions_are_ordered() correctly returns false if
null partition is not the last one, for example, for:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2_null partition of p for values in (2, null);
create table p3 partition of p for values in (3);

Maybe, a small comment regarding how that works correctly would be nice.

hmm, but there is a comment. It says:

/*
* LIST partitions can also guarantee ordering, but we'd need to
* ensure that partitions don't allow interleaved values. We
* could likely check for this looking at each partition, in
* order, and checking which Datums are accepted. If we find a
* Datum in a partition that's greater than one previously already
* seen, then values could become out of order and we'd have to
* disable the optimization. For now, let's just keep it simple
* and just accept LIST partitions without a DEFAULT partition
* which only accept a single Datum per partition. This is cheap
* as it does not require any per-partition processing. Maybe
* we'd like to handle more complex cases in the future.
*/

Your example above breaks the "don't allow interleaved values" and
"just accept LIST partitions without a DEFAULT partition which only
accept a single Datum per partition."

Do you think I need to add something like "and if there is a NULL
partition, that it only accepts NULL values"? I think that's implied
already, but if you think it's confusing then maybe it's worth adding
something along those lines.

How about extending the sentence about when the optimization is disabled
as follows:

"If we find a Datum in a partition that's greater than one previously
already seen, then values could become out of order and we'd have to
disable the optimization. We'd also need to disable the optimization if
NULL values are interleaved with other Datum values, because the calling
code expect them to be present in the last partition."

Further, extend the "For now..." sentence as:

"For now, let's just keep it simple and just accept LIST partitioned table
without a DEFAULT partition where each partition only accepts a single
Datum or NULL. It's OK to always accept NULL partition in that case,
because PartitionBoundInfo lists it as the last partition."

Thanks,
Amit

#75

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#74)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Tue, 2 Apr 2019 at 14:26, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

How about extending the sentence about when the optimization is disabled
as follows:

"If we find a Datum in a partition that's greater than one previously
already seen, then values could become out of order and we'd have to
disable the optimization. We'd also need to disable the optimization if
NULL values are interleaved with other Datum values, because the calling
code expect them to be present in the last partition."

Further, extend the "For now..." sentence as:

"For now, let's just keep it simple and just accept LIST partitioned table
without a DEFAULT partition where each partition only accepts a single
Datum or NULL. It's OK to always accept NULL partition in that case,
because PartitionBoundInfo lists it as the last partition."

I ended up rewording the entire thing and working on the header
comment for the function too. I think previously it wasn't that well
defined what "ordered" meant. I added a mention that we expect that
NULLs, if possible must come in the last partition.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v18.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v18.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3282be0e4b..82ca6826ab 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 727da33338..bcafcba5ad 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -44,6 +44,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
+#include "partitioning/partbounds.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -96,15 +97,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1520,7 +1522,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1562,7 +1564,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1612,19 +1614,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1674,7 +1676,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 
@@ -1703,44 +1705,82 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
-											NULL, path->parallel_workers,
-											true,
-											partitioned_rels, partial_rows);
+											NIL, NULL, path->parallel_workers,
+											true, partitioned_rels,
+											partial_rows);
 			add_partial_path(rel, (Path *) appendpath);
 		}
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+													  &partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+														   BackwardScanDirection,
+														   &partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on (a,
+		 * b), and a query with an ORDER BY a, b, c.  We can still allow an
+		 * Append scan in this case.  Imagine each partition has a btree index
+		 * on (a, b, c), scanning those indexes still provides tuples in the
+		 * correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1749,6 +1789,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+			(!partition_pathkeys_partial &&
+			 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+			(pathkeys_contained_in(pathkeys,
+								   partition_pathkeys_desc) ||
+			 (!partition_pathkeys_desc_partial &&
+			  pathkeys_contained_in(partition_pathkeys_desc,
+									pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1791,26 +1854,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
-														rel,
-														startup_subpaths,
-														pathkeys,
-														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+													  startup_subpaths,
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
 			add_path(rel, (Path *) create_merge_append_path(root,
 															rel,
-															total_subpaths,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1951,6 +2074,36 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ *
+ * Note: 'path' must not be a parallel aware path.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1974,7 +2127,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..f3f9c421a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 rows and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+							  root,
+							  pathkeys,
+							  subpath->total_cost,
+							  subpath->parent->tuples,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem,
+							  apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 34cc7dacdf..5f865ed01a 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1260,7 +1260,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..aef11e0832 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -18,16 +18,20 @@
 #include "postgres.h"
 
 #include "access/stratnum.h"
+#include "catalog/pg_opfamily.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol, RelOptInfo *partrel);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -546,6 +550,153 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then it's irrelevant for sort-order
+ * considerations.  Restriction clauses like WHERE partkeycol = constant, get
+ * turned into an EquivalenceClass containing a constant, which is recognized
+ * as redundant by build_partition_pathkeys().  But if the partition column is
+ * a boolean variable (or expression), then we are not going to see WHERE
+ * partkeycol = constant, because expression preprocessing will have
+ * simplified that to "WHERE partkeycol" or "WHERE NOT partkeycol".  So we are
+ * not going to have a matching EquivalenceClass (unless the query also
+ * contains "ORDER BY partkeycol").  To allow such cases to work the same as
+ * they would for non-boolean values, this function is provided to detect
+ * whether the specified partkey column matches a boolean restriction clause.
+ */
+static bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
+
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *)lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+	RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *)rinfo->clause;
+	Expr	   *partexpr = (Expr *)linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *)get_notclausearg((Expr *)clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	Assert(partitions_are_ordered(partrel));
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 * A PartitionDesc always lists any NULL partition last, so we can
+		 * simply pass the ScanDirectionIsBackward(scandir) for nulls_first
+		 * since NULLS FIRST is the default for DESC, and NULLS LAST is the
+		 * default for ASC sort orders.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											  ScanDirectionIsBackward(scandir),
+											  ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cc222cb06c..83388b8104 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * won't match the parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 3a1b846217..61e2a539fa 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1721,7 +1721,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -3997,6 +3998,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 56de8fc370..5e3309955f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,6 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1251,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths when
+	 * the Append has valid pathkeys.  The order they're listed in is critical
+	 * to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1262,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1291,10 +1305,7 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.pathkeys = child->pathkeys;
 	}
 	else
-	{
-		pathnode->path.pathkeys = NIL;	/* unsorted if more than 1 subpath */
-		cost_append(pathnode);
-	}
+		cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3736,7 +3747,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index bdd0d23854..9dd378d7a0 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "parser/parse_coerce.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
@@ -861,6 +862,73 @@ partition_bounds_copy(PartitionBoundInfo src,
 	return dest;
 }
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that its direct partitions cannot allow
+ *		higher sort order tuples in a partition that comes earlier in the
+ *		PartitionDesc, i.e. if the partitions are scanned in order, then a
+ *		partition coming later in the PartitionDesc will only have tuples >
+ *		than tuples from all the previously scanned partitions.  NULL values,
+ *		if possible, must come in the last partition defined in the
+ *		PartitionDesc.  If out of order, or there are insufficient proofs to
+ *		know the order then we return false.
+ */
+bool
+partitions_are_ordered(struct RelOptInfo *partrel)
+{
+	PartitionBoundInfo boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		case PARTITION_STRATEGY_RANGE:
+
+			/*
+			 * RANGE type partitions guarantee that the partitions can be
+			 * scanned in the order that they're defined in the PartitionDesc
+			 * to provide non-overlapping ranges of tuples.  We must disallow
+			 * when a DEFAULT partition exists as this could contain tuples
+			 * from either below or above the defined range, or contain tuples
+			 * belonging to gaps in the defined range.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		case PARTITION_STRATEGY_LIST:
+
+			/*
+			 * LIST partitions can also guarantee ordering, but we'd need to
+			 * ensure that partitions don't allow interleaved values.  We
+			 * could likely check for this looping over the PartitionBound's
+			 * indexes array checking that the indexes are in order.  For now,
+			 * let's just keep it simple and just accept LIST partitions
+			 * without a DEFAULT partition which only accept a single Datum
+			 * per partition and a NULL partition that does not accept any
+			 * other values.  Such a NULL partition will come last in the
+			 * PartitionDesc.  This is cheap test to make as it does not
+			 * require any per-partition processing.  Maybe we'd like to
+			 * handle more complex cases in the future.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
+				!= partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * check_new_partition_bound
  *
@@ -1680,6 +1748,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1767,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 88c8973f3c..426d7ec3aa 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -280,7 +280,13 @@ struct PlannerInfo
 
 	List	   *join_info_list; /* list of SpecialJoinInfos */
 
-	List	   *append_rel_list;	/* list of AppendRelInfos */
+	/*
+	 * list of AppendRelInfos.  For AppendRelInfos belonging to partitions of
+	 * a partitioned table, this list guarantees that partitions that come
+	 * earlier in the partitioned table's PartitionDesc will come earlier in
+	 * this list.
+	 */
+	List	   *append_rel_list;
 
 	List	   *rowMarks;		/* list of PlanRowMarks */
 
@@ -1366,6 +1372,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 3a803b3fd0..ba649facb5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 683e1574ea..215a87f59b 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -75,6 +75,8 @@ typedef struct PartitionBoundInfoData
 #define partition_bound_accepts_nulls(bi) ((bi)->null_index != -1)
 #define partition_bound_has_default(bi) ((bi)->default_index != -1)
 
+struct RelOptInfo;
+
 extern int	get_hash_partition_greatest_modulus(PartitionBoundInfo b);
 extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 							 Oid *partcollation,
@@ -88,6 +90,7 @@ extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   PartitionBoundInfo b2);
 extern PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src,
 					  PartitionKey key);
+extern bool partitions_are_ordered(struct RelOptInfo *partrel);
 extern void check_new_partition_bound(char *relname, Relation parent,
 						  PartitionBoundSpec *spec);
 extern void check_default_partition_contents(Relation parent,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 7518148df0..a94f44a652 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2037,7 +2037,237 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_1k_b_a_idx on bool_rp_true_1k
+         Index Cond: (b = true)
+   ->  Index Only Scan using bool_rp_true_2k_b_a_idx on bool_rp_true_2k
+         Index Cond: (b = true)
+(5 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_1k_b_a_idx on bool_rp_false_1k
+         Index Cond: (b = false)
+   ->  Index Only Scan using bool_rp_false_2k_b_a_idx on bool_rp_false_2k
+         Index Cond: (b = false)
+(5 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 7806ba1d47..0789b316eb 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3078,14 +3078,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3132,17 +3132,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3155,13 +3153,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3174,12 +3171,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3188,23 +3184,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..89a0f8c229 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index 2e4d2b483d..c30e58eef7 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -775,15 +775,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -804,7 +804,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#76

Amit Langote

amitlangote09@gmail.com

almost 7 years ago

In reply to: David Rowley (#75)

Re: Ordered Partitioned Table Scans

Hi David,

On Tue, Apr 2, 2019 at 8:49 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I ended up rewording the entire thing and working on the header
comment for the function too. I think previously it wasn't that well
defined what "ordered" meant. I added a mention that we expect that
NULLs, if possible must come in the last partition.

Thanks for the updated patch.

New descriptions look good, although was amused by this:

diff --git a/src/backend/partitioning/partbounds.c
b/src/backend/partitioning/partbounds.c
index bdd0d23854..9dd378d7a0 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
...
+partitions_are_ordered(struct RelOptInfo *partrel)

Maybe, "struct" is unnecessary?

Thanks,
Amit

#77

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#76)

1 attachment(s)

Re: Ordered Partitioned Table Scans

On Wed, 3 Apr 2019 at 01:26, Amit Langote <amitlangote09@gmail.com> wrote:

+#include "nodes/pathnodes.h"
...
+partitions_are_ordered(struct RelOptInfo *partrel)

Maybe, "struct" is unnecessary?

I just left it there so that the signature matched the header file.
Looking around for examples I see make_partition_pruneinfo() has the
structs only in the header file, so I guess that is how we do things,
so changed to that in the attached.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

mergeappend_to_append_conversion_v19.patchapplication/octet-stream; name=mergeappend_to_append_conversion_v19.patchDownload

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3282be0e4b..82ca6826ab 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1847,6 +1847,7 @@ _outAppendPath(StringInfo str, const AppendPath *node)
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(subpaths);
 	WRITE_INT_FIELD(first_partial_path);
+	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 727da33338..bcafcba5ad 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -44,6 +44,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
+#include "partitioning/partbounds.h"
 #include "partitioning/partprune.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/lsyscache.h"
@@ -96,15 +97,16 @@ static void set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 					Index rti, RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels);
+static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels);
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 									  RelOptInfo *rel,
 									  Relids required_outer);
 static void accumulate_append_subpath(Path *path,
 						  List **subpaths, List **special_subpaths);
+static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
@@ -1520,7 +1522,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (subpaths_valid)
 		add_path(rel, (Path *) create_append_path(root, rel, subpaths, NIL,
-												  NULL, 0, false,
+												  NIL, NULL, 0, false,
 												  partitioned_rels, -1));
 
 	/*
@@ -1562,7 +1564,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Generate a partial append path. */
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
-										NULL, parallel_workers,
+										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
 
@@ -1612,19 +1614,19 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
-										NULL, parallel_workers, true,
+										NIL, NULL, parallel_workers, true,
 										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
 	/*
-	 * Also build unparameterized MergeAppend paths based on the collected
+	 * Also build unparameterized ordered append paths based on the collected
 	 * list of child pathkeys.
 	 */
 	if (subpaths_valid)
-		generate_mergeappend_paths(root, rel, live_childrels,
-								   all_child_pathkeys,
-								   partitioned_rels);
+		generate_orderedappend_paths(root, rel, live_childrels,
+									 all_child_pathkeys,
+									 partitioned_rels);
 
 	/*
 	 * Build Append paths for each parameterization seen among the child rels.
@@ -1674,7 +1676,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
-										required_outer, 0, false,
+										NIL, required_outer, 0, false,
 										partitioned_rels, -1));
 	}
 
@@ -1703,44 +1705,82 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
-											NULL, path->parallel_workers,
-											true,
-											partitioned_rels, partial_rows);
+											NIL, NULL, path->parallel_workers,
+											true, partitioned_rels,
+											partial_rows);
 			add_partial_path(rel, (Path *) appendpath);
 		}
 	}
 }
 
 /*
- * generate_mergeappend_paths
- *		Generate MergeAppend paths for an append relation
+ * generate_orderedappend_paths
+ *		Generate ordered append paths for an append relation.
  *
- * Generate a path for each ordering (pathkey list) appearing in
+ * Generally we generate MergeAppend paths here, but there are some special
+ * cases where we can generate simple Append paths when we're able to
+ * determine that the list of subpaths provide tuples in the required order.
+ *
+ * We generate a path for each ordering (pathkey list) appearing in
  * all_child_pathkeys.
  *
  * We consider both cheapest-startup and cheapest-total cases, ie, for each
  * interesting ordering, collect all the cheapest startup subpaths and all the
- * cheapest total paths, and build a MergeAppend path for each case.
- *
- * We don't currently generate any parameterized MergeAppend paths.  While
- * it would not take much more code here to do so, it's very unclear that it
- * is worth the planning cycles to investigate such paths: there's little
- * use for an ordered path on the inside of a nestloop.  In fact, it's likely
- * that the current coding of add_path would reject such paths out of hand,
- * because add_path gives no credit for sort ordering of parameterized paths,
- * and a parameterized MergeAppend is going to be more expensive than the
+ * cheapest total paths, and build a suitable path for each case.
+ *
+ * We don't currently generate any parameterized paths here.  While it would
+ * not take much more code to do so, it's very unclear that it is worth the
+ * planning cycles to investigate such paths: there's little use for an
+ * ordered path on the inside of a nestloop.  In fact, it's likely that the
+ * current coding of add_path would reject such paths out of hand, because
+ * add_path gives no credit for sort ordering of parameterized paths, and a
+ * parameterized MergeAppend is going to be more expensive than the
  * corresponding parameterized Append path.  If we ever try harder to support
  * parameterized mergejoin plans, it might be worth adding support for
- * parameterized MergeAppends to feed such joins.  (See notes in
+ * parameterized paths here to feed such joins.  (See notes in
  * optimizer/README for why that might not ever happen, though.)
  */
 static void
-generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
-						   List *live_childrels,
-						   List *all_child_pathkeys,
-						   List *partitioned_rels)
+generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
+							 List *live_childrels,
+							 List *all_child_pathkeys,
+							 List *partitioned_rels)
 {
 	ListCell   *lcp;
+	List	   *partition_pathkeys = NIL;
+	List	   *partition_pathkeys_desc = NIL;
+	bool		partition_pathkeys_partial = true;
+	bool		partition_pathkeys_desc_partial = true;
+
+	/*
+	 * Some partitioned table setups may allow us to use an Append node
+	 * instead of a MergeAppend.  This is possible in cases such as RANGE
+	 * partitioned tables where it's guaranteed that an earlier partition must
+	 * contain rows which come earlier in the sort order.
+	 */
+	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+		partitions_are_ordered(rel))
+	{
+		partition_pathkeys = build_partition_pathkeys(root, rel,
+													  ForwardScanDirection,
+													  &partition_pathkeys_partial);
+
+		partition_pathkeys_desc = build_partition_pathkeys(root, rel,
+														   BackwardScanDirection,
+														   &partition_pathkeys_desc_partial);
+
+		/*
+		 * You might think we should truncate_useless_pathkeys here, but
+		 * allowing partition keys which are a subset of the query's pathkeys
+		 * can often be useful.  For example, a RANGE partitioned table on (a,
+		 * b), and a query with an ORDER BY a, b, c.  We can still allow an
+		 * Append scan in this case.  Imagine each partition has a btree index
+		 * on (a, b, c), scanning those indexes still provides tuples in the
+		 * correct order and using an Append in place of a MergeAppend is
+		 * still valid since lower-order  (a, b) tuples will still come before
+		 * higher-order ones over all partitions.
+		 */
+	}
 
 	foreach(lcp, all_child_pathkeys)
 	{
@@ -1749,6 +1789,29 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *total_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
+		bool		partition_order;
+		bool		partition_order_desc;
+
+		/*
+		 * Determine if these pathkeys are contained in the partition pathkeys
+		 * for both ascending and decending partition order.  If the
+		 * partitioned pathkeys happened to be contained in pathkeys then this
+		 * is fine too, providing that the partition pathkeys are complete and
+		 * not just a prefix of the partition order.  In this case an Append
+		 * scan cannot produce any out of order tuples.
+		 */
+		partition_order = pathkeys_contained_in(pathkeys,
+												partition_pathkeys) ||
+			(!partition_pathkeys_partial &&
+			 pathkeys_contained_in(partition_pathkeys, pathkeys));
+
+		partition_order_desc = !partition_order &&
+			(pathkeys_contained_in(pathkeys,
+								   partition_pathkeys_desc) ||
+			 (!partition_pathkeys_desc_partial &&
+			  pathkeys_contained_in(partition_pathkeys_desc,
+									pathkeys)));
+
 
 		/* Select the child paths for this ordering... */
 		foreach(lcr, live_childrels)
@@ -1791,26 +1854,86 @@ generate_mergeappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			if (cheapest_startup != cheapest_total)
 				startup_neq_total = true;
 
-			accumulate_append_subpath(cheapest_startup,
-									  &startup_subpaths, NULL);
-			accumulate_append_subpath(cheapest_total,
-									  &total_subpaths, NULL);
+			/*
+			 * Build an Append path when in partition order.  If in reverse
+			 * partition order we build a reverse list of subpaths so that we
+			 * scan them in the opposite order.
+			 */
+			if (partition_order)
+			{
+				/*
+				 * Attempt to flatten subpaths that are themselves Appends or
+				 * MergeAppends.  We can do this providing the Append or
+				 * MergeAppend has just a single subpath.  If there are
+				 * multiple subpaths then we can't make guarantees about the
+				 * order tuples in those subpaths, so we must leave the
+				 * Append/MergeAppend in place.
+				 */
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
+				total_subpaths = lappend(total_subpaths, cheapest_total);
+			}
+			else if (partition_order_desc)
+			{
+				cheapest_startup = get_singleton_append_subpath(cheapest_startup);
+				cheapest_total = get_singleton_append_subpath(cheapest_total);
+
+				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
+				total_subpaths = lcons(cheapest_total, total_subpaths);
+			}
+			else
+			{
+				accumulate_append_subpath(cheapest_startup,
+										  &startup_subpaths, NULL);
+				accumulate_append_subpath(cheapest_total,
+										  &total_subpaths, NULL);
+			}
 		}
 
-		/* ... and build the MergeAppend paths */
-		add_path(rel, (Path *) create_merge_append_path(root,
-														rel,
-														startup_subpaths,
-														pathkeys,
-														NULL,
-														partitioned_rels));
-		if (startup_neq_total)
+		/* Build simple Append paths if in partition asc/desc order */
+		if (partition_order || partition_order_desc)
+		{
+			add_path(rel, (Path *) create_append_path(root,
+													  rel,
+													  startup_subpaths,
+													  NIL,
+													  pathkeys,
+													  NULL,
+													  0,
+													  false,
+													  partitioned_rels,
+													  -1));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  partitioned_rels,
+														  -1));
+		}
+		else
+		{
+			/* else just build the MergeAppend paths */
 			add_path(rel, (Path *) create_merge_append_path(root,
 															rel,
-															total_subpaths,
+															startup_subpaths,
 															pathkeys,
 															NULL,
 															partitioned_rels));
+			if (startup_neq_total)
+				add_path(rel, (Path *) create_merge_append_path(root,
+																rel,
+																total_subpaths,
+																pathkeys,
+																NULL,
+																partitioned_rels));
+		}
 	}
 }
 
@@ -1951,6 +2074,36 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 	*subpaths = lappend(*subpaths, path);
 }
 
+/*
+ * get_singleton_append_subpath
+ *		Returns the singleton subpath of an Append/MergeAppend or
+ *		return 'path' if it's not a single sub-path Append/MergeAppend.
+ *
+ * Note: 'path' must not be a parallel aware path.
+ */
+static Path *
+get_singleton_append_subpath(Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+
+		Assert(!apath->path.parallel_aware);
+
+		if (list_length(apath->subpaths) == 1)
+			return (Path *) linitial(apath->subpaths);
+	}
+	else if (IsA(path, MergeAppendPath))
+	{
+		MergeAppendPath *mpath = (MergeAppendPath *) path;
+
+		if (list_length(mpath->subpaths) == 1)
+			return (Path *) linitial(mpath->subpaths);
+	}
+
+	return path;
+}
+
 /*
  * set_dummy_rel_pathlist
  *	  Build a dummy path for a relation that's been excluded by constraints
@@ -1974,7 +2127,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4b9be13f08..f3f9c421a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1872,7 +1872,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
  *	  Determines and returns the cost of an Append node.
  */
 void
-cost_append(AppendPath *apath)
+cost_append(PlannerInfo *root, AppendPath *apath)
 {
 	ListCell   *l;
 
@@ -1884,21 +1884,72 @@ cost_append(AppendPath *apath)
 
 	if (!apath->path.parallel_aware)
 	{
-		Path	   *subpath = (Path *) linitial(apath->subpaths);
+		List	   *pathkeys = apath->path.pathkeys;
 
-		/*
-		 * Startup cost of non-parallel-aware Append is the startup cost of
-		 * first subpath.
-		 */
-		apath->path.startup_cost = subpath->startup_cost;
+		if (pathkeys == NIL)
+		{
+			Path	   *subpath = (Path *) linitial(apath->subpaths);
 
-		/* Compute rows and costs as sums of subplan rows and costs. */
-		foreach(l, apath->subpaths)
+			/*
+			 * When there are no pathkeys the startup cost of
+			 * non-parallel-aware Append is the startup cost of the first
+			 * subpath.
+			 */
+			apath->path.startup_cost = subpath->startup_cost;
+
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+
+				apath->path.rows += subpath->rows;
+				apath->path.total_cost += subpath->total_cost;
+			}
+		}
+		else
 		{
-			Path	   *subpath = (Path *) lfirst(l);
+			/*
+			 * Otherwise we make the Append's startup cost the sum of the
+			 * startup cost of all the subpaths.  It may appear like we should
+			 * just be doing the same as above and take the startup cost of
+			 * just the initial subpath, however, it is possible that when a
+			 * LIMIT clause exists in the query that we could end up favoring
+			 * these ordered Append paths too much.  Imagine a scenario where
+			 * the initial subpath is already ordered and is estimated to
+			 * contain just 10 rows and the 2nd subpath requires a sort and is
+			 * estimated to have 10 million rows, if the query has LIMIT 11
+			 * then we could end up performing an expensive sort for just a
+			 * single row without having considered the startup cost for the
+			 * 2nd subpath.  Such a scenario could end up favoring a MergeJoin
+			 * plan instead of a Hash Join plan.
+			 */
+			foreach(l, apath->subpaths)
+			{
+				Path	   *subpath = (Path *) lfirst(l);
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+				{
+					/*
+					 * We'll need to insert a Sort node, so include cost for
+					 * that.
+					 */
+					cost_sort(&sort_path,
+							  root,
+							  pathkeys,
+							  subpath->total_cost,
+							  subpath->parent->tuples,
+							  subpath->pathtarget->width,
+							  0.0,
+							  work_mem,
+							  apath->limit_tuples);
+
+					subpath = &sort_path;
+				}
 
-			apath->path.rows += subpath->rows;
-			apath->path.total_cost += subpath->total_cost;
+				apath->path.rows += subpath->rows;
+				apath->path.startup_cost += subpath->startup_cost;
+				apath->path.total_cost += subpath->total_cost;
+			}
 		}
 	}
 	else						/* parallel-aware */
@@ -1906,6 +1957,8 @@ cost_append(AppendPath *apath)
 		int			i = 0;
 		double		parallel_divisor = get_parallel_divisor(&apath->path);
 
+		Assert(apath->path.pathkeys == NIL);
+
 		/* Calculate startup cost. */
 		foreach(l, apath->subpaths)
 		{
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 34cc7dacdf..5f865ed01a 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1260,7 +1260,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->partial_pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL,
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL, NIL,
 											  rel->lateral_relids,
 											  0, false, NIL, -1));
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 56d839bb31..aef11e0832 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -18,16 +18,20 @@
 #include "postgres.h"
 
 #include "access/stratnum.h"
+#include "catalog/pg_opfamily.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/plannodes.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
+#include "partitioning/partbounds.h"
 #include "utils/lsyscache.h"
 
 
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
+static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
+								 int partkeycol, RelOptInfo *partrel);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -546,6 +550,153 @@ build_index_pathkeys(PlannerInfo *root,
 	return retval;
 }
 
+/*
+ * partkey_is_bool_constant_for_query
+ *
+ * If a partition key column is constrained to have a constant value by the
+ * query's WHERE conditions, then it's irrelevant for sort-order
+ * considerations.  Restriction clauses like WHERE partkeycol = constant, get
+ * turned into an EquivalenceClass containing a constant, which is recognized
+ * as redundant by build_partition_pathkeys().  But if the partition column is
+ * a boolean variable (or expression), then we are not going to see WHERE
+ * partkeycol = constant, because expression preprocessing will have
+ * simplified that to "WHERE partkeycol" or "WHERE NOT partkeycol".  So we are
+ * not going to have a matching EquivalenceClass (unless the query also
+ * contains "ORDER BY partkeycol").  To allow such cases to work the same as
+ * they would for non-boolean values, this function is provided to detect
+ * whether the specified partkey column matches a boolean restriction clause.
+ */
+static bool
+partkey_is_bool_constant_for_query(RelOptInfo *partrel, int partkeycol)
+{
+	PartitionScheme partscheme;
+	ListCell   *lc;
+
+	partscheme = partrel->part_scheme;
+
+	/* If the partkey isn't boolean, we can't possibly get a match */
+	if (!IsBooleanOpfamily(partscheme->partopfamily[partkeycol]))
+		return false;
+
+	/* Check each restriction clause for partrel */
+	foreach(lc, partrel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *)lfirst(lc);
+
+		/* Skip pseudoconstant quals */
+		if (rinfo->pseudoconstant)
+			continue;
+
+		/* See if we can match the clause's expression to the partkey column */
+		if (matches_boolean_partition_clause(rinfo, partkeycol, partrel))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * matches_boolean_partition_clause
+ *		Determine if rinfo matches partrel's 'partkeycol' partition key
+ *		column.
+ */
+static bool
+matches_boolean_partition_clause(RestrictInfo *rinfo, int partkeycol,
+	RelOptInfo *partrel)
+{
+	Node	   *clause = (Node *)rinfo->clause;
+	Expr	   *partexpr = (Expr *)linitial(partrel->partexprs[partkeycol]);
+
+	/* Direct match? */
+	if (equal(partexpr, clause))
+		return true;
+	/* NOT clause? */
+	else if (is_notclause(clause))
+	{
+		Node	   *arg = (Node *)get_notclausearg((Expr *)clause);
+
+		if (equal(partexpr, arg))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * build_partition_pathkeys
+ *	  Build a pathkeys list that describes the ordering induced by the
+ *	  partitions of 'partrel'.  (Callers must ensure that this partitioned
+ *	  table guarantees that lower order tuples never will be found in a
+ *	  later partition.).  Sets *partialkeys to false if pathkeys were only
+ *	  built for a prefix of the partition key, otherwise sets it to true.
+ */
+List *
+build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys)
+{
+	PartitionScheme partscheme;
+	List	   *retval = NIL;
+	int			i;
+
+	Assert(partitions_are_ordered(partrel));
+
+	partscheme = partrel->part_scheme;
+
+	for (i = 0; i < partscheme->partnatts; i++)
+	{
+		PathKey    *cpathkey;
+		Expr	   *keyCol = linitial(partrel->partexprs[i]);
+
+		/*
+		 * OK, try to make a canonical pathkey for this part key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 * A PartitionDesc always lists any NULL partition last, so we can
+		 * simply pass the ScanDirectionIsBackward(scandir) for nulls_first
+		 * since NULLS FIRST is the default for DESC, and NULLS LAST is the
+		 * default for ASC sort orders.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  keyCol,
+											  NULL,
+											  partscheme->partopfamily[i],
+											  partscheme->partopcintype[i],
+											  partscheme->partcollation[i],
+											  ScanDirectionIsBackward(scandir),
+											  ScanDirectionIsBackward(scandir),
+											  0,
+											  partrel->relids,
+											  false);
+
+		/*
+		 * When unable to create the pathkey we'll just need to return
+		 * whatever ones we have so far.
+		 */
+		if (cpathkey == NULL)
+		{
+			/*
+			 * Boolean partition keys might be redundant even if they do not
+			 * appear in an EquivalenceClass, because of our special treatment
+			 * of boolean equality conditions --- see the comment for
+			 * partkey_is_bool_constant_for_query().  If that applies, we can
+			 * continue to examine lower-order partition keys.  Otherwise, we
+			 * must abort and return any partial matches we've found so far.
+			 */
+			if (partkey_is_bool_constant_for_query(partrel, i))
+				continue;
+
+			*partialkeys = true;
+			return retval;
+		}
+
+		/* Add it to list, unless it's redundant. */
+		if (!pathkey_is_redundant(cpathkey, retval))
+			retval = lappend(retval, cpathkey);
+	}
+
+	*partialkeys = false;
+	return retval;
+}
+
 /*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cc222cb06c..83388b8104 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -205,8 +205,6 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -1058,12 +1056,20 @@ create_join_plan(PlannerInfo *root, JoinPath *best_path)
 static Plan *
 create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
-	Append	   *plan;
+	Append	   *node = makeNode(Append);
+	Plan	   *plan = &node->plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
+	List	   *pathkeys = best_path->path.pathkeys;
 	List	   *subplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	PartitionPruneInfo *partpruneinfo = NULL;
+	AttrNumber *nodeSortColIdx;
+
+	plan->targetlist = tlist;
+	plan->qual = NIL;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1089,6 +1095,28 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
+	if (pathkeys != NIL)
+	{
+		int			nodenumsortkeys;
+		Oid		   *nodeSortOperators;
+		Oid		   *nodeCollations;
+		bool	   *nodeNullsFirst;
+
+		/*
+		 * Compute sort column info, and adjust the Append's tlist as needed.
+		 * We only need the 'nodeSortColIdx' from all of the output params.
+		 */
+		(void) prepare_sort_from_pathkeys(plan, pathkeys,
+										  best_path->path.parent->relids,
+										  NULL,
+										  true,
+										  &nodenumsortkeys,
+										  &nodeSortColIdx,
+										  &nodeSortOperators,
+										  &nodeCollations,
+										  &nodeNullsFirst);
+	}
+
 	/* Build the plan for each child */
 	foreach(subpaths, best_path->subpaths)
 	{
@@ -1098,6 +1126,39 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
+		/*
+		 * Now, for appends with pathkeys, insert a Sort node if subplan isn't
+		 * sufficiently ordered.
+		 */
+		if (pathkeys != NIL)
+		{
+			int			numsortkeys;
+			AttrNumber *sortColIdx;
+			Oid		   *sortOperators;
+			Oid		   *collations;
+			bool	   *nullsFirst;
+
+			/* Compute sort column info, and adjust subplan's tlist as needed */
+			subplan = prepare_sort_from_pathkeys(subplan, pathkeys,
+												 subpath->parent->relids,
+												 nodeSortColIdx,
+												 false,
+												 &numsortkeys,
+												 &sortColIdx,
+												 &sortOperators,
+												 &collations,
+												 &nullsFirst);
+
+			if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+			{
+				Sort	   *sort = make_sort(subplan, numsortkeys,
+											 sortColIdx, sortOperators,
+											 collations, nullsFirst);
+
+				label_sort_with_costsize(root, sort, best_path->limit_tuples);
+				subplan = (Plan *) sort;
+			}
+		}
 		subplans = lappend(subplans, subplan);
 	}
 
@@ -1140,10 +1201,11 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * won't match the parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, partpruneinfo);
+	node->appendplans = subplans;
+	node->first_partial_plan = best_path->first_partial_path;
+	node->part_prune_info = partpruneinfo;
 
-	copy_generic_path_info(&plan->plan, (Path *) best_path);
+	copy_generic_path_info(plan, (Path *) best_path);
 
 	return (Plan *) plan;
 }
@@ -5300,23 +5362,6 @@ make_foreignscan(List *qptlist,
 	return node;
 }
 
-static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, PartitionPruneInfo *partpruneinfo)
-{
-	Append	   *node = makeNode(Append);
-	Plan	   *plan = &node->plan;
-
-	plan->targetlist = tlist;
-	plan->qual = NIL;
-	plan->lefttree = NULL;
-	plan->righttree = NULL;
-	node->appendplans = appendplans;
-	node->first_partial_plan = first_partial_plan;
-	node->part_prune_info = partpruneinfo;
-	return node;
-}
-
 static RecursiveUnion *
 make_recursive_union(List *tlist,
 					 Plan *lefttree,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c430189572..6a6a9d4380 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1721,7 +1721,8 @@ inheritance_planner(PlannerInfo *root)
 
 		/* Make a dummy path, cf set_dummy_rel_pathlist() */
 		dummy_path = (Path *) create_append_path(NULL, final_rel, NIL, NIL,
-												 NULL, 0, false, NIL, -1);
+												 NIL, NULL, 0, false, NIL,
+												 -1);
 
 		/* These lists must be nonempty to make a valid ModifyTable node */
 		subpaths = list_make1(dummy_path);
@@ -4003,6 +4004,7 @@ create_degenerate_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							   grouped_rel,
 							   paths,
 							   NIL,
+							   NIL,
 							   NULL,
 							   0,
 							   false,
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index eb815c2f12..65d77a5402 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -616,7 +616,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/*
@@ -671,7 +671,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 		Assert(parallel_workers > 0);
 
 		ppath = (Path *)
-			create_append_path(root, result_rel, NIL, partial_pathlist,
+			create_append_path(root, result_rel, NIL, partial_pathlist, NIL,
 							   NULL, parallel_workers, enable_parallel_append,
 							   NIL, -1);
 		ppath = (Path *)
@@ -782,7 +782,7 @@ generate_nonunion_paths(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	path = (Path *) create_append_path(root, result_rel, pathlist, NIL,
+	path = (Path *) create_append_path(root, result_rel, pathlist, NIL, NIL,
 									   NULL, 0, false, NIL, -1);
 
 	/* Identify the grouping semantics */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1ea89ff54c..db8dc02887 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1208,7 +1208,7 @@ AppendPath *
 create_append_path(PlannerInfo *root,
 				   RelOptInfo *rel,
 				   List *subpaths, List *partial_subpaths,
-				   Relids required_outer,
+				   List *pathkeys, Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows)
 {
@@ -1242,6 +1242,7 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_aware = parallel_aware;
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->partitioned_rels = list_copy(partitioned_rels);
 
 	/*
@@ -1251,10 +1252,14 @@ create_append_path(PlannerInfo *root,
 	 * costs.  There may be some paths that require to do startup work by a
 	 * single worker.  In such case, it's better for workers to choose the
 	 * expensive ones first, whereas the leader should choose the cheapest
-	 * startup plan.
+	 * startup plan.  Note: We mustn't fiddle with the order of subpaths when
+	 * the Append has valid pathkeys.  The order they're listed in is critical
+	 * to keeping the pathkeys valid.
 	 */
 	if (pathnode->path.parallel_aware)
 	{
+		Assert(pathkeys == NIL);
+
 		subpaths = list_qsort(subpaths, append_total_cost_compare);
 		partial_subpaths = list_qsort(partial_subpaths,
 									  append_startup_cost_compare);
@@ -1262,6 +1267,15 @@ create_append_path(PlannerInfo *root,
 	pathnode->first_partial_path = list_length(subpaths);
 	pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 
+	/*
+	 * Apply query-wide LIMIT if known and path is for sole base relation.
+	 * (Handling this at this low level is a bit klugy.)
+	 */
+	if (root != NULL && bms_equal(rel->relids, root->all_baserels))
+		pathnode->limit_tuples = root->limit_tuples;
+	else
+		pathnode->limit_tuples = -1.0;
+
 	foreach(l, pathnode->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
@@ -1291,10 +1305,7 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.pathkeys = child->pathkeys;
 	}
 	else
-	{
-		pathnode->path.pathkeys = NIL;	/* unsorted if more than 1 subpath */
-		cost_append(pathnode);
-	}
+		cost_append(root, pathnode);
 
 	/* If the caller provided a row estimate, override the computed value. */
 	if (rows >= 0)
@@ -3759,7 +3770,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				}
 				return (Path *)
 					create_append_path(root, rel, childpaths, partialpaths,
-									   required_outer,
+									   NIL, required_outer,
 									   apath->path.parallel_workers,
 									   apath->path.parallel_aware,
 									   apath->partitioned_rels,
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index bdd0d23854..a24d849d35 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "parser/parse_coerce.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
@@ -861,6 +862,73 @@ partition_bounds_copy(PartitionBoundInfo src,
 	return dest;
 }
 
+/*
+ * partitions_are_ordered
+ *		For the partitioned table given in 'partrel', returns true if the
+ *		partitioned table guarantees that its direct partitions cannot allow
+ *		higher sort order tuples in a partition that comes earlier in the
+ *		PartitionDesc, i.e. if the partitions are scanned in order, then a
+ *		partition coming later in the PartitionDesc will only have tuples >
+ *		than tuples from all the previously scanned partitions.  NULL values,
+ *		if possible, must come in the last partition defined in the
+ *		PartitionDesc.  If out of order, or there are insufficient proofs to
+ *		know the order then we return false.
+ */
+bool
+partitions_are_ordered(RelOptInfo *partrel)
+{
+	PartitionBoundInfo boundinfo = partrel->boundinfo;
+
+	Assert(boundinfo != NULL);
+
+	switch (boundinfo->strategy)
+	{
+		case PARTITION_STRATEGY_RANGE:
+
+			/*
+			 * RANGE type partitions guarantee that the partitions can be
+			 * scanned in the order that they're defined in the PartitionDesc
+			 * to provide non-overlapping ranges of tuples.  We must disallow
+			 * when a DEFAULT partition exists as this could contain tuples
+			 * from either below or above the defined range, or contain tuples
+			 * belonging to gaps in the defined range.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+			break;
+
+		case PARTITION_STRATEGY_LIST:
+
+			/*
+			 * LIST partitions can also guarantee ordering, but we'd need to
+			 * ensure that partitions don't allow interleaved values.  We
+			 * could likely check for this looping over the PartitionBound's
+			 * indexes array checking that the indexes are in order.  For now,
+			 * let's just keep it simple and just accept LIST partitions
+			 * without a DEFAULT partition which only accept a single Datum
+			 * per partition and a NULL partition that does not accept any
+			 * other values.  Such a NULL partition will come last in the
+			 * PartitionDesc.  This is cheap test to make as it does not
+			 * require any per-partition processing.  Maybe we'd like to
+			 * handle more complex cases in the future.
+			 */
+
+			if (partition_bound_has_default(boundinfo))
+				return false;
+
+			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
+				!= partrel->nparts)
+				return false;
+			break;
+
+		default:
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * check_new_partition_bound
  *
@@ -1680,6 +1748,8 @@ qsort_partition_hbound_cmp(const void *a, const void *b)
  * qsort_partition_list_value_cmp
  *
  * Compare two list partition bound datums.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
@@ -1697,6 +1767,8 @@ qsort_partition_list_value_cmp(const void *a, const void *b, void *arg)
  * qsort_partition_rbound_cmp
  *
  * Used when sorting range bounds across all range partitions.
+ *
+ * Note: If changing this, see build_partition_pathkeys()
  */
 static int32
 qsort_partition_rbound_cmp(const void *a, const void *b, void *arg)
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 15314a8cfe..066449954d 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -280,7 +280,13 @@ struct PlannerInfo
 
 	List	   *join_info_list; /* list of SpecialJoinInfos */
 
-	List	   *append_rel_list;	/* list of AppendRelInfos */
+	/*
+	 * list of AppendRelInfos.  For AppendRelInfos belonging to partitions of
+	 * a partitioned table, this list guarantees that partitions that come
+	 * earlier in the partitioned table's PartitionDesc will come earlier in
+	 * this list.
+	 */
+	List	   *append_rel_list;
 
 	List	   *rowMarks;		/* list of PlanRowMarks */
 
@@ -1366,6 +1372,7 @@ typedef struct AppendPath
 
 	/* Index of first partial path in subpaths */
 	int			first_partial_path;
+	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } AppendPath;
 
 #define IS_DUMMY_APPEND(p) \
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index ac6de0f6be..9da3f19c6e 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -103,7 +103,7 @@ extern void cost_sort(Path *path, PlannerInfo *root,
 		  List *pathkeys, Cost input_cost, double tuples, int width,
 		  Cost comparison_cost, int sort_mem,
 		  double limit_tuples);
-extern void cost_append(AppendPath *path);
+extern void cost_append(PlannerInfo *root, AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
 				  List *pathkeys, int n_streams,
 				  Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 5b577c12ec..abc69011df 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -64,7 +64,7 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals, Relids required_outer);
 extern AppendPath *create_append_path(PlannerInfo *root, RelOptInfo *rel,
-				   List *subpaths, List *partial_subpaths,
+				   List *subpaths, List *partial_subpaths, List *pathkeys,
 				   Relids required_outer,
 				   int parallel_workers, bool parallel_aware,
 				   List *partitioned_rels, double rows);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 36d12bc376..0e858097c8 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -194,6 +194,8 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 					 ScanDirection scandir);
+extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
+						 ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
 						 Relids nullable_relids, Oid opno,
 						 Relids rel, bool create_it);
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 683e1574ea..215a87f59b 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -75,6 +75,8 @@ typedef struct PartitionBoundInfoData
 #define partition_bound_accepts_nulls(bi) ((bi)->null_index != -1)
 #define partition_bound_has_default(bi) ((bi)->default_index != -1)
 
+struct RelOptInfo;
+
 extern int	get_hash_partition_greatest_modulus(PartitionBoundInfo b);
 extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 							 Oid *partcollation,
@@ -88,6 +90,7 @@ extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
 					   PartitionBoundInfo b2);
 extern PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src,
 					  PartitionKey key);
+extern bool partitions_are_ordered(struct RelOptInfo *partrel);
 extern void check_new_partition_bound(char *relname, Relation parent,
 						  PartitionBoundSpec *spec);
 extern void check_default_partition_contents(Relation parent,
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 7518148df0..a94f44a652 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2037,7 +2037,237 @@ explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mc
          Filter: ((c > 20) AND (a = 20))
 (9 rows)
 
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Merge Append
+   Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan using mcrparted_def_a_abs_c_idx on mcrparted_def
+(9 rows)
+
+drop table mcrparted_def;
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5_a_abs_c_idx on mcrparted5
+(7 rows)
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Scan Backward using mcrparted5_a_abs_c_idx on mcrparted5
+   ->  Index Scan Backward using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan Backward using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan Backward using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan Backward using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan Backward using mcrparted0_a_abs_c_idx on mcrparted0
+(7 rows)
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Merge Append
+         Sort Key: mcrparted5a.a, (abs(mcrparted5a.b)), mcrparted5a.c
+         ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+         ->  Index Scan using mcrparted5_def_a_abs_c_idx on mcrparted5_def
+(10 rows)
+
+drop table mcrparted5_def;
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+   ->  Index Scan using mcrparted4_a_abs_c_idx on mcrparted4
+   ->  Index Scan using mcrparted5a_a_abs_c_idx on mcrparted5a
+(7 rows)
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted0_a_abs_c_idx on mcrparted0
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a < 20)
+   ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+         Index Cond: (a < 20)
+(9 rows)
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+(3 rows)
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+explain (costs off) select * from mclparted order by a;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted1.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4
+(6 rows)
+
+drop table mclparted;
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+                               QUERY PLAN                                
+-------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Sort
+               Sort Key: mcrparted0.a, (abs(mcrparted0.b)), mcrparted0.c
+               ->  Seq Scan on mcrparted0
+                     Filter: (a < 20)
+         ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+               Index Cond: (a < 20)
+         ->  Index Scan using mcrparted3_a_abs_c_idx on mcrparted3
+               Index Cond: (a < 20)
+(12 rows)
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Append
+   ->  Index Scan using mcrparted1_a_abs_c_idx on mcrparted1
+         Index Cond: (a = 10)
+   ->  Index Scan using mcrparted2_a_abs_c_idx on mcrparted2
+         Index Cond: (a = 10)
+(5 rows)
+
+reset enable_bitmapscan;
 drop table mcrparted;
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+explain (costs off) select * from bool_lp order by b;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_lp_false_b_idx on bool_lp_false
+   ->  Index Only Scan using bool_lp_true_b_idx on bool_lp_true
+(3 rows)
+
+drop table bool_lp;
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_true_1k_b_a_idx on bool_rp_true_1k
+         Index Cond: (b = true)
+   ->  Index Only Scan using bool_rp_true_2k_b_a_idx on bool_rp_true_2k
+         Index Cond: (b = true)
+(5 rows)
+
+explain (costs off) select * from bool_rp where b = false order by b,a;
+                                QUERY PLAN                                
+--------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using bool_rp_false_1k_b_a_idx on bool_rp_false_1k
+         Index Cond: (b = false)
+   ->  Index Only Scan using bool_rp_false_2k_b_a_idx on bool_rp_false_2k
+         Index Cond: (b = false)
+(5 rows)
+
+drop table bool_rp;
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+explain (costs off) select * from range_parted order by a,b,c;
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using range_parted1_a_b_c_idx on range_parted1
+   ->  Index Only Scan using range_parted2_a_b_c_idx on range_parted2
+(3 rows)
+
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+                                  QUERY PLAN                                   
+-------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan Backward using range_parted2_a_b_c_idx on range_parted2
+   ->  Index Only Scan Backward using range_parted1_a_b_c_idx on range_parted1
+(3 rows)
+
+drop table range_parted;
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 7806ba1d47..0789b316eb 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3078,14 +3078,14 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
 execute mt_q1(0);
@@ -3132,17 +3132,15 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(15);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=2 loops=1)
-   Sort Key: ma_test_p2.a
+   Sort Key: ma_test_p2.b
    Subplans Removed: 1
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
          Rows Removed by Filter: 9
-(11 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(9 rows)
 
 execute mt_q1(15);
  a  
@@ -3155,13 +3153,12 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(25);
                                   QUERY PLAN                                   
 -------------------------------------------------------------------------------
  Merge Append (actual rows=1 loops=1)
-   Sort Key: ma_test_p3.a
+   Sort Key: ma_test_p3.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=1 loops=1)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-         Rows Removed by Filter: 4
-(7 rows)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=1 loops=1)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+         Rows Removed by Filter: 9
+(6 rows)
 
 execute mt_q1(25);
  a  
@@ -3174,12 +3171,11 @@ explain (analyze, costs off, summary off, timing off) execute mt_q1(35);
                                QUERY PLAN                               
 ------------------------------------------------------------------------
  Merge Append (actual rows=0 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    Subplans Removed: 2
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-         Filter: ((a % 10) = 5)
-(6 rows)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: ((a >= $1) AND ((a % 10) = 5))
+(5 rows)
 
 execute mt_q1(35);
  a 
@@ -3188,23 +3184,23 @@ execute mt_q1(35);
 
 deallocate mt_q1;
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
                                                  QUERY PLAN                                                 
 ------------------------------------------------------------------------------------------------------------
  Merge Append (actual rows=20 loops=1)
-   Sort Key: ma_test_p1.a
+   Sort Key: ma_test_p1.b
    InitPlan 2 (returns $1)
      ->  Result (actual rows=1 loops=1)
            InitPlan 1 (returns $0)
              ->  Limit (actual rows=1 loops=1)
-                   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
-                         Index Cond: (a IS NOT NULL)
-   ->  Index Scan using ma_test_p1_a_idx on ma_test_p1 (never executed)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p2_a_idx on ma_test_p2 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
-   ->  Index Scan using ma_test_p3_a_idx on ma_test_p3 (actual rows=10 loops=1)
-         Index Cond: (a >= $1)
+                   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 ma_test_p2_1 (actual rows=1 loops=1)
+                         Index Cond: (b IS NOT NULL)
+   ->  Index Scan using ma_test_p1_b_idx on ma_test_p1 (never executed)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p2_b_idx on ma_test_p2 (actual rows=10 loops=1)
+         Filter: (a >= $1)
+   ->  Index Scan using ma_test_p3_b_idx on ma_test_p3 (actual rows=10 loops=1)
+         Filter: (a >= $1)
 (14 rows)
 
 reset enable_seqscan;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5480fe7db4..89a0f8c229 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -715,8 +715,111 @@ explain (costs off) select * from mcrparted where abs(b) = 5;	-- scans all parti
 explain (costs off) select * from mcrparted where a > -1;	-- scans all partitions
 explain (costs off) select * from mcrparted where a = 20 and abs(b) = 10 and c > 10;	-- scans mcrparted4
 explain (costs off) select * from mcrparted where a = 20 and c > 20; -- scans mcrparted3, mcrparte4, mcrparte5, mcrparted_def
+
+-- Test code that uses Append nodes in place of MergeAppend when the
+-- partitions guarantee earlier partitions means lower sort order of the
+-- tuples contained within.
+create index mcrparted_a_abs_c_idx on mcrparted (a, abs(b), c);
+
+-- check MergeAppend is used when a default partition exists
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted_def;
+
+-- check Append is used for RANGE partitioned table with no default and no subpartitions
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+-- check Append is used with subpaths in reverse order with backwards index scans.
+explain (costs off) select * from mcrparted order by a desc, abs(b) desc, c desc;
+
+-- check that Append plan is used containing a MergeAppend for sub-partitions
+-- that are unordered.
+drop table mcrparted5;
+create table mcrparted5 partition of mcrparted for values from (20, 20, 20) to (maxvalue, maxvalue, maxvalue) partition by list (a);
+create table mcrparted5a partition of mcrparted5 for values in(20);
+create table mcrparted5_def partition of mcrparted5 default;
+
+explain (costs off) select * from mcrparted order by a, abs(b), c;
+
+drop table mcrparted5_def;
+
+-- check that an Append plan is used and the sub-partitions are flattened
+-- into the main Append when the sub-partition is unordered but contains
+-- just a single sub-partition.
+explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
+
+-- check that Append is used when the sub-partitioned tables are pruned during planning.
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
+
+create table mclparted (a int) partition by list(a);
+create table mclparted1 partition of mclparted for values in(1);
+create table mclparted2 partition of mclparted for values in(2);
+create index on mclparted (a);
+
+-- Ensure an Append is used to for a list partition with an order by.
+explain (costs off) select * from mclparted order by a;
+
+-- Ensure a MergeAppend is used when a partition exists with interleaved
+-- datums in the partition bound.
+create table mclparted3_5 partition of mclparted for values in(3,5);
+create table mclparted4 partition of mclparted for values in(4);
+
+explain (costs off) select * from mclparted order by a;
+
+drop table mclparted;
+
+-- Ensure subplans which don't have a path with the correct pathkeys get
+-- sorted correctly.
+drop index mcrparted_a_abs_c_idx;
+create index on mcrparted1 (a, abs(b), c);
+create index on mcrparted2 (a, abs(b), c);
+create index on mcrparted3 (a, abs(b), c);
+create index on mcrparted4 (a, abs(b), c);
+
+explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c limit 1;
+
+set enable_bitmapscan = 0;
+-- Ensure Append node can be used when the partition is ordered by some
+-- pathkeys which were deemed redundant.
+explain (costs off) select * from mcrparted where a = 10 order by a, abs(b), c;
+reset enable_bitmapscan;
+
 drop table mcrparted;
 
+-- Ensure LIST partitions allow an Append to be used instead of a MergeAppend
+create table bool_lp (b bool) partition by list(b);
+create table bool_lp_true partition of bool_lp for values in(true);
+create table bool_lp_false partition of bool_lp for values in(false);
+create index on bool_lp (b);
+
+explain (costs off) select * from bool_lp order by b;
+
+drop table bool_lp;
+
+-- Ensure const bool quals can be properly detected as redundant
+create table bool_rp (b bool, a int) partition by range(b,a);
+create table bool_rp_false_1k partition of bool_rp for values from (false,0) to (false,1000);
+create table bool_rp_true_1k partition of bool_rp for values from (true,0) to (true,1000);
+create table bool_rp_false_2k partition of bool_rp for values from (false,1000) to (false,2000);
+create table bool_rp_true_2k partition of bool_rp for values from (true,1000) to (true,2000);
+create index on bool_rp (b,a);
+explain (costs off) select * from bool_rp where b = true order by b,a;
+explain (costs off) select * from bool_rp where b = false order by b,a;
+
+drop table bool_rp;
+
+-- Ensure an Append scan is chosen when the partition order is a subset of
+-- the required order.
+create table range_parted (a int, b int, c int) partition by range(a, b);
+create table range_parted1 partition of range_parted for values from (0,0) to (10,10);
+create table range_parted2 partition of range_parted for values from (10,10) to (20,20);
+create index on range_parted (a,b,c);
+
+explain (costs off) select * from range_parted order by a,b,c;
+explain (costs off) select * from range_parted order by a desc,b desc,c desc;
+
+drop table range_parted;
+
 -- check that partitioned table Appends cope with being referenced in
 -- subplans
 create table parted_minmax (a int, b varchar(16)) partition by range (a);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index 2e4d2b483d..c30e58eef7 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -775,15 +775,15 @@ drop table boolp;
 --
 set enable_seqscan = off;
 set enable_sort = off;
-create table ma_test (a int) partition by range (a);
+create table ma_test (a int, b int) partition by range (a);
 create table ma_test_p1 partition of ma_test for values from (0) to (10);
 create table ma_test_p2 partition of ma_test for values from (10) to (20);
 create table ma_test_p3 partition of ma_test for values from (20) to (30);
-insert into ma_test select x from generate_series(0,29) t(x);
-create index on ma_test (a);
+insert into ma_test select x,x from generate_series(0,29) t(x);
+create index on ma_test (b);
 
 analyze ma_test;
-prepare mt_q1 (int) as select * from ma_test where a >= $1 and a % 10 = 5 order by a;
+prepare mt_q1 (int) as select a from ma_test where a >= $1 and a % 10 = 5 order by b;
 
 -- Execute query 5 times to allow choose_custom_plan
 -- to start considering a generic plan.
@@ -804,7 +804,7 @@ execute mt_q1(35);
 deallocate mt_q1;
 
 -- ensure initplan params properly prune partitions
-explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(a) from ma_test_p2) order by a;
+explain (analyze, costs off, summary off, timing off) select * from ma_test where a >= (select min(b) from ma_test_p2) order by b;
 
 reset enable_seqscan;
 reset enable_sort;

#78

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: David Rowley (#77)

Re: Ordered Partitioned Table Scans

On 2019/04/03 3:10, David Rowley wrote:

On Wed, 3 Apr 2019 at 01:26, Amit Langote <amitlangote09@gmail.com> wrote:
+#include "nodes/pathnodes.h"
...
+partitions_are_ordered(struct RelOptInfo *partrel)
Maybe, "struct" is unnecessary?
I just left it there so that the signature matched the header file.
Looking around for examples I see make_partition_pruneinfo() has the
structs only in the header file, so I guess that is how we do things,
so changed to that in the attached.

Ah, I see. Thanks for updating the patch.

I don't have any more comments.

Thanks,
Amit

#79

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Amit Langote (#78)

Re: Ordered Partitioned Table Scans

On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I don't have any more comments.

Great. Many thanks for having a look at this. Going by [1]/messages/by-id/CAOBaU_YpTQbFqcP5jYJZETPL6mgYuTwVTVVBZKZKC6XDBTDkfQ@mail.gmail.com, Julien
also seems pretty happy, so I'm going to change this over to ready for
committer.

[1]: /messages/by-id/CAOBaU_YpTQbFqcP5jYJZETPL6mgYuTwVTVVBZKZKC6XDBTDkfQ@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#80

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: David Rowley (#79)

Re: Ordered Partitioned Table Scans

Hi,

On Wed, Apr 3, 2019 at 7:47 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I don't have any more comments.

Great. Many thanks for having a look at this. Going by [1], Julien
also seems pretty happy, so I'm going to change this over to ready for
committer.

Yes, I'm very sorry I didn't had time to come back on this thread
yesterday. I was fine with the patch (I didn't noticed the
partprune.c part though), and I'm happy with Amit's further comments
and this last patch, so +1 for me!

#81

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: David Rowley (#79)

Re: Ordered Partitioned Table Scans

David Rowley <david.rowley@2ndquadrant.com> writes:

On Wed, 3 Apr 2019 at 14:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I don't have any more comments.

Great. Many thanks for having a look at this. Going by [1], Julien
also seems pretty happy, so I'm going to change this over to ready for
committer.

Pushed with some hacking, mostly trying to improve the comments.
The only really substantive thing I changed was that you'd missed
teaching ExecSetTupleBound about descending through Append.
Now that we can have plans that are Limit-above-Append-above-Sort,
that's important.

regards, tom lane

#82

David Rowley

david.rowley@2ndquadrant.com

almost 7 years ago

In reply to: Tom Lane (#81)

Re: Ordered Partitioned Table Scans

On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pushed with some hacking, mostly trying to improve the comments.

Great! Many thanks for improving those and pushing it.

Many thanks to Julien, Antonin for their detailed reviews on this.
Thanks Amit for your input on this as well. Much appreciated.

The only really substantive thing I changed was that you'd missed
teaching ExecSetTupleBound about descending through Append.
Now that we can have plans that are Limit-above-Append-above-Sort,
that's important.

Oops. Wasn't aware about that until now.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#83

Julien Rouhaud

rjuju123@gmail.com

almost 7 years ago

In reply to: David Rowley (#82)

Re: Ordered Partitioned Table Scans

On Sat, Apr 6, 2019 at 2:45 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pushed with some hacking, mostly trying to improve the comments.

Great! Many thanks for improving those and pushing it.

Many thanks to Julien, Antonin for their detailed reviews on this.
Thanks Amit for your input on this as well. Much appreciated.

Thanks! I'm glad that we'll have this optimization for pg12.

#84

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 7 years ago

In reply to: Julien Rouhaud (#83)

Re: Ordered Partitioned Table Scans

On 2019/04/06 18:06, Julien Rouhaud wrote:

On Sat, Apr 6, 2019 at 2:45 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Sat, 6 Apr 2019 at 12:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pushed with some hacking, mostly trying to improve the comments.

Great! Many thanks for improving those and pushing it.

Many thanks to Julien, Antonin for their detailed reviews on this.
Thanks Amit for your input on this as well. Much appreciated.

Thanks! I'm glad that we'll have this optimization for pg12.

+1, thank you for working on this. It's nice to see partitioning being
useful to optimize query execution in even more cases.

Regards,
Amit