Path question

Started by Boszormenyi Zoltanover 15 years ago28 messages
#1Boszormenyi Zoltan
zb@cybertec.at

Hi,

we are experimenting with modifying table partitioning
so the ORDER BY clause can be pushed down to
child nodes on the grounds that:
1. smaller datasets are faster to sort, e.g. two datasets that almost
spill out to disk are faster to sort in memory and later merge them
than the union set that spills out to disk, or two larger sets
that spill out to disk are faster to sort individually than the
union dataset (because of the longer seeks, etc)
2. individual child nodes can have indexes that produce
the sorted output already

Currently I am abusing the AppendPath node but later I will add a new
executor node that will merge the sorted input coming from the children.

I added the pathkey to the AppendPath to reflect that it produces
sorted output and I am adding the Sort plan to the children.

My problem is:

zozo=# explain select * from t1 where d = '2008-01-01' order by d;
QUERY
PLAN
---------------------------------------------------------------------------------------------------
Result (cost=8.28..33.13 rows=4 width=40)
-> Append (cost=8.28..33.13 rows=4 width=40)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_d_key on t1 (cost=0.00..8.27
rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2008_d_key on t1_2008 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2009_d_key on t1_2009 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2010_d_key on t1_2010 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)
(18 rows)

In one leaf, e.g.:

-> Sort (cost=8.28..8.28 rows=1 width=40)
Sort Key: public.t1.d
-> Index Scan using t1_2010_d_key on t1_2010 t1
(cost=0.00..8.27 rows=1 width=40)
Index Cond: (d = '2008-01-01'::date)

The plan is scanning the t_2010 child table, but the Sort is trying to
sort by the fully qualified parent table. I think this is a problem
but I don't know how to solve it. I have tried transforming the
parent query with

adjust_appendrel_attrs((Node *) parse, appinfo)

where parse is

Query *parse = root->parse;

in set_append_rel_pathlist() and the transformed query trees are
used for the children with

make_sort_from_sortclauses(root, query->sortClause, subplan)

in create_append_plan(). adjust_appendrel_attrs() seems to be prepared
to be called with a Query * , so I don't know why the above leaf plan
doesn't show "Sort Key: public.t1_2010.d" and so on.

Can someone help me?

Best regards,
Zolt�n B�sz�tm�nyi

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Boszormenyi Zoltan (#1)
Re: Path question

Boszormenyi Zoltan <zb@cybertec.at> writes:

we are experimenting with modifying table partitioning
so the ORDER BY clause can be pushed down to
child nodes on the grounds that:

This is really premature, and anything you do along those lines now will
probably never get committed. The problem is that the transformation
you propose is wrong unless the planner can prove that the different
child tables contain nonoverlapping ranges of the sort key. Now you
might be intending to add logic to try to prove that from inspection of
constraints, but I don't believe that reverse-engineering such knowledge
on the fly is a sane approach: it will be hugely expensive and will add
that cost even in many situations where the optimization fails to apply.

The project direction is that we are going to add some explicit
representation of partitioned tables. After that, the planner can just
know immediately that a range-partitioned sort key is amenable to this
treatment, and at that point it'll make sense to work on it.

regards, tom lane

#3Greg Stark
gsstark@mit.edu
In reply to: Boszormenyi Zoltan (#1)
Re: Path question

2010/9/1 Boszormenyi Zoltan <zb@cybertec.at>:

we are experimenting with modifying table partitioning
so the ORDER BY clause can be pushed down to
child nodes on the grounds that:
1. smaller datasets are faster to sort, e.g. two datasets that almost
   spill out to disk are faster to sort in memory and later merge them
   than the union set that spills out to disk, or two larger sets
   that spill out to disk are faster to sort individually than the
   union dataset (because of the longer seeks, etc)

For what it's worth this logic doesn't necessarily hold. Sorting two
data sets in memory is only faster if you only need one result set at
a time. If you know the first set contains only records which come
before the second then you can do this but if not then you'll need all
the result sets to be available at the same time which will require
spilling them to disk. If you have to do that then you might as well
do a tapesort anyways.

2. individual child nodes can have indexes that produce
   the sorted output already

This is the big win. Currently Append nodes mean throwing out any
ordering from the child nodes and that's important information to be
losing.

Currently I am abusing the AppendPath node but later I will add a new
executor node that will merge the sorted input coming from the children.

You realize I already did this a few years ago. Search for "merge
append" on the lists. The hard part -- as usual -- ends up being
figuring out in the planner when to do this optimization and how to
avoid exponential growth in the number of plans considered and the
amount of data retained about each plan.

For what it's worth I disagree with Tom. I think this is a situation
where we need *both* types of solution. Ideally we will be able to use
a plain Append node for cases where we know the relative ordering of
the data in different partitions, but there will always be cases where
the structured partition data doesn't actually match up with the
ordering requested and we'll need to fall back to a merge-append node.

--
greg

In reply to: Tom Lane (#2)
Re: Path question

On Sep 1, 2010, at 4:10 PM, Tom Lane wrote:

Boszormenyi Zoltan <zb@cybertec.at> writes:

we are experimenting with modifying table partitioning
so the ORDER BY clause can be pushed down to
child nodes on the grounds that:

This is really premature, and anything you do along those lines now will
probably never get committed. The problem is that the transformation
you propose is wrong unless the planner can prove that the different
child tables contain nonoverlapping ranges of the sort key. Now you
might be intending to add logic to try to prove that from inspection of
constraints, but I don't believe that reverse-engineering such knowledge
on the fly is a sane approach: it will be hugely expensive and will add
that cost even in many situations where the optimization fails to apply.

well, why non-overlapping? the idea is to make append smart enough to take the sorted lists from below and merge them which will give sorted output as well.
my original idea was what you described but given Martijn van Oosterhout's posting we were pretty confident that we can get along without non-overlapping partitions.

The project direction is that we are going to add some explicit
representation of partitioned tables. After that, the planner can just
know immediately that a range-partitioned sort key is amenable to this
treatment, and at that point it'll make sense to work on it.

can you outline some ideas here and maybe point to some useful discussion here?

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: PostgreSQL - Hans-Jürgen Schönig (#4)
Re: Path question

=?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at> writes:

On Sep 1, 2010, at 4:10 PM, Tom Lane wrote:

This is really premature, and anything you do along those lines now will
probably never get committed.

well, why non-overlapping? the idea is to make append smart enough to
take the sorted lists from below and merge them which will give sorted
output as well.

Well, an extra merge step is going to change the cost comparisons quite
a bit; see Greg Starks' comments. But in any case, my point wasn't that
this is something we should never do; it was that it makes more sense to
wait till something has happened with explicit partitioning.

The project direction is that we are going to add some explicit
representation of partitioned tables.

can you outline some ideas here and maybe point to some useful discussion here?

There's been boatloads of discussion of partitioning, and at least one
submitted patch, over the past year or so ...

regards, tom lane

In reply to: Tom Lane (#5)
1 attachment(s)
Re: Path question

hello tom,

yeah, we have followed quite a lot of discussion as well ...
and yes, no patches.

as far as this problem is concerned: we are working on a patch and did some prototyping inside the planner already (attached).
the code we have is pretty limited atm (such as checking for a sort clause explicitly and so on - it has no support for windowing related optimizations and so on so far).

the cost model is not our problem - it is a lot easier to read than the code we are fighting with (it is a different level of complexity). i think costs can be handled.

yes, this merging adds some costs for sure. we might see a hell amount of operators being called - but i think given a reasonable number of partitions it is still a lot cheaper than actually resorting ... and, it is a lot more space efficient.
in my practical case i cannot resort because we would simply run out of space. an index scan is expensive but needs no additional sort space ...
and, merge is O(n) which sort is clearly not.

advise is highly appreciated.

many thanks,

hans

Attachments:

push-down-sort-into-inh-2.patchapplication/octet-stream; name=push-down-sort-into-inh-2.patchDownload
diff -durpN pgsql.orig/src/backend/optimizer/path/allpaths.c pgsql/src/backend/optimizer/path/allpaths.c
--- pgsql.orig/src/backend/optimizer/path/allpaths.c	2010-03-29 00:59:32.000000000 +0200
+++ pgsql/src/backend/optimizer/path/allpaths.c	2010-08-25 11:56:10.000000000 +0200
@@ -170,6 +170,10 @@ set_rel_pathlist(PlannerInfo *root, RelO
 	{
 		/* It's an "append relation", process accordingly */
 		set_append_rel_pathlist(root, rel, rti, rte);
+#if 0
+		elog(NOTICE, "set_rel_pathlist setting sortClause to NIL");
+		root->parse->sortClause = NIL;
+#endif
 	}
 	else if (rel->rtekind == RTE_SUBQUERY)
 	{
@@ -282,8 +286,10 @@ static void
 set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte)
 {
+	Query	   *parse = root->parse;
 	int			parentRTindex = rti;
 	List	   *subpaths = NIL;
+	List	   *pathkeys = NIL;
 	double		parent_rows;
 	double		parent_size;
 	double	   *parent_attrsizes;
@@ -309,6 +315,14 @@ set_append_rel_pathlist(PlannerInfo *roo
 	nattrs = rel->max_attr - rel->min_attr + 1;
 	parent_attrsizes = (double *) palloc0(nattrs * sizeof(double));
 
+	if (parse->sortClause)
+	{
+		pathkeys = make_pathkeys_for_sortclauses(root,
+								parse->sortClause,
+								parse->targetList,
+								false);
+	}
+
 	/*
 	 * Generate access paths for each member relation, and pick the cheapest
 	 * path for each one.
@@ -423,6 +437,7 @@ set_append_rel_pathlist(PlannerInfo *roo
 		set_rel_pathlist(root, childrel, childRTindex, childRTE);
 
 		childpath = childrel->cheapest_total_path;
+
 		if (IsA(childpath, AppendPath))
 			subpaths = list_concat(subpaths,
 								   ((AppendPath *) childpath)->subpaths);
@@ -487,7 +502,7 @@ set_append_rel_pathlist(PlannerInfo *roo
 	 * the parent rel.	(Note: this is correct even if we have zero or one
 	 * live subpath due to constraint exclusion.)
 	 */
-	add_path(rel, (Path *) create_append_path(rel, subpaths));
+	add_path(rel, (Path *) create_append_path(rel, subpaths, pathkeys, true));
 
 	/* Select cheapest path (pretty easy in this case...) */
 	set_cheapest(rel);
@@ -507,7 +522,7 @@ set_dummy_rel_pathlist(RelOptInfo *rel)
 	rel->rows = 0;
 	rel->width = 0;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NIL, false));
 
 	/* Select cheapest path (pretty easy in this case...) */
 	set_cheapest(rel);
diff -durpN pgsql.orig/src/backend/optimizer/path/joinpath.c pgsql/src/backend/optimizer/path/joinpath.c
--- pgsql.orig/src/backend/optimizer/path/joinpath.c	2010-04-19 09:42:28.000000000 +0200
+++ pgsql/src/backend/optimizer/path/joinpath.c	2010-08-25 10:51:50.000000000 +0200
@@ -955,7 +955,7 @@ best_appendrel_indexscan(PlannerInfo *ro
 		return NULL;
 
 	/* Form and return the completed Append path. */
-	return (Path *) create_append_path(rel, append_paths);
+	return (Path *) create_append_path(rel, append_paths, NIL, false);
 }
 
 /*
diff -durpN pgsql.orig/src/backend/optimizer/path/joinrels.c pgsql/src/backend/optimizer/path/joinrels.c
--- pgsql.orig/src/backend/optimizer/path/joinrels.c	2010-02-26 03:00:45.000000000 +0100
+++ pgsql/src/backend/optimizer/path/joinrels.c	2010-08-25 10:51:53.000000000 +0200
@@ -932,7 +932,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL));
+	add_path(rel, (Path *) create_append_path(rel, NIL, NIL, false));
 
 	/* Set or update cheapest_total_path */
 	set_cheapest(rel);
diff -durpN pgsql.orig/src/backend/optimizer/plan/createplan.c pgsql/src/backend/optimizer/plan/createplan.c
--- pgsql.orig/src/backend/optimizer/plan/createplan.c	2010-07-13 10:51:03.000000000 +0200
+++ pgsql/src/backend/optimizer/plan/createplan.c	2010-08-25 12:03:45.000000000 +0200
@@ -605,12 +605,40 @@ create_append_plan(PlannerInfo *root, Ap
 									NULL);
 	}
 
-	/* Normal case with multiple subpaths */
-	foreach(subpaths, best_path->subpaths)
+	if (best_path->inherited && root->parse->sortClause)
 	{
-		Path	   *subpath = (Path *) lfirst(subpaths);
+		/* Query from a partitioned table with ORDER BY */
+		foreach(subpaths, best_path->subpaths)
+		{
+			Path	   *subpath = (Path *) lfirst(subpaths);
+			Plan	   *subplan = create_plan_recurse(root, subpath);
+			ListCell   *parent_lc;
+			List	   *parent_tlist = root->parse->targetList;
+			ListCell   *child_lc;
+			List	   *child_tlist = subplan->targetlist;
 
-		subplans = lappend(subplans, create_plan_recurse(root, subpath));
+			/* Adjust ressortgroupref of the child targetlist */
+			forboth(parent_lc, parent_tlist, child_lc, child_tlist)
+			{
+				TargetEntry	   *parent_tle = (TargetEntry *) lfirst(parent_lc);
+				TargetEntry	   *child_tle  = (TargetEntry *) lfirst(child_lc);
+
+				child_tle->ressortgroupref = parent_tle->ressortgroupref;
+			}
+
+			subplans = lappend(subplans,
+							make_sort_from_sortclauses(root, root->parse->sortClause, subplan));
+		}
+	}
+	else
+	{
+		/* Normal case with multiple subpaths */
+		foreach(subpaths, best_path->subpaths)
+		{
+			Path	   *subpath = (Path *) lfirst(subpaths);
+
+			subplans = lappend(subplans, create_plan_recurse(root, subpath));
+		}
 	}
 
 	plan = make_append(subplans, tlist);
diff -durpN pgsql.orig/src/backend/optimizer/util/pathnode.c pgsql/src/backend/optimizer/util/pathnode.c
--- pgsql.orig/src/backend/optimizer/util/pathnode.c	2010-03-29 00:59:33.000000000 +0200
+++ pgsql/src/backend/optimizer/util/pathnode.c	2010-08-25 11:15:51.000000000 +0200
@@ -638,16 +638,17 @@ create_tidscan_path(PlannerInfo *root, R
  *	  pathnode.
  */
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths)
+create_append_path(RelOptInfo *rel, List *subpaths, List *pathkeys, bool inherited)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
 
 	pathnode->path.pathtype = T_Append;
 	pathnode->path.parent = rel;
-	pathnode->path.pathkeys = NIL;		/* result is always considered
+	pathnode->path.pathkeys = pathkeys;		/* result is always considered
 										 * unsorted */
 	pathnode->subpaths = subpaths;
+	pathnode->inherited = inherited;
 
 	pathnode->path.startup_cost = 0;
 	pathnode->path.total_cost = 0;
diff -durpN pgsql.orig/src/include/nodes/relation.h pgsql/src/include/nodes/relation.h
--- pgsql.orig/src/include/nodes/relation.h	2010-07-13 10:51:07.000000000 +0200
+++ pgsql/src/include/nodes/relation.h	2010-08-25 10:50:02.000000000 +0200
@@ -750,6 +750,7 @@ typedef struct AppendPath
 {
 	Path		path;
 	List	   *subpaths;		/* list of component Paths */
+	bool		inherited;
 } AppendPath;
 
 #define IS_DUMMY_PATH(p) \
diff -durpN pgsql.orig/src/include/optimizer/pathnode.h pgsql/src/include/optimizer/pathnode.h
--- pgsql.orig/src/include/optimizer/pathnode.h	2010-03-29 00:59:33.000000000 +0200
+++ pgsql/src/include/optimizer/pathnode.h	2010-08-25 10:50:23.000000000 +0200
@@ -46,7 +46,8 @@ extern BitmapOrPath *create_bitmap_or_pa
 					  List *bitmapquals);
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals);
-extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths);
+extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths,
+					List *pathkeys, bool inherited);
 extern ResultPath *create_result_path(List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
#7Robert Haas
robertmhaas@gmail.com
In reply to: Greg Stark (#3)
Re: Path question

On Sep 1, 2010, at 10:21 AM, Greg Stark <gsstark@mit.edu> wrote:

For what it's worth I disagree with Tom. I think this is a situation
where we need *both* types of solution. Ideally we will be able to use
a plain Append node for cases where we know the relative ordering of
the data in different partitions, but there will always be cases where
the structured partition data doesn't actually match up with the
ordering requested and we'll need to fall back to a merge-append node.

I agree. Explicit partitioning may open up some additional optimization possibilities in certain cases, but Merge Append is more general and extremely valuable in its own right.

...Robert

In reply to: Robert Haas (#7)
1 attachment(s)
Re: Path question

On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:

On Sep 1, 2010, at 10:21 AM, Greg Stark <gsstark@mit.edu> wrote:

For what it's worth I disagree with Tom. I think this is a situation
where we need *both* types of solution. Ideally we will be able to use
a plain Append node for cases where we know the relative ordering of
the data in different partitions, but there will always be cases where
the structured partition data doesn't actually match up with the
ordering requested and we'll need to fall back to a merge-append node.

I agree. Explicit partitioning may open up some additional optimization possibilities in certain cases, but Merge Append is more general and extremely valuable in its own right.

...Robert
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

we have revised greg's wonderful work and ported the entire thing to head.
it solves the problem of merge_append. i did some testing earlier on today and it seems most important cases are working nicely.

here are some test cases:

test=# \d t_data
Table "public.t_data"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |

test=# \d t_data_1
Table "public.t_data_1"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |
Indexes:
"idx_1" btree (id)
Check constraints:
"t_data_1_id_check" CHECK (id >= 1 AND id <= 10000)
Inherits: t_data

test=# \d t_data_2
Table "public.t_data_2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |
Indexes:
"idx_2" btree (id)
Check constraints:
"t_data_2_id_check" CHECK (id >= 10001 AND id <= 20000)
Inherits: t_data

test=# \d t_data_3
Table "public.t_data_3"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
tstamp | date |
Indexes:
"idx_3" btree (id)
Check constraints:
"t_data_3_id_check" CHECK (id >= 20001 AND id <= 30000)
Inherits: t_data

simple windowing ...

test=# explain select *, max(id) OVER ( ORDER BY id) from t_data ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
WindowAgg (cost=149.99..2154.43 rows=32140 width=8)
-> Result (cost=149.99..1672.33 rows=32140 width=8)
-> Append (cost=149.99..1672.33 rows=32140 width=8)
-> Sort (cost=149.78..155.13 rows=2140 width=8)
Sort Key: public.t_data.id
-> Seq Scan on t_data (cost=0.00..31.40 rows=2140 width=8)
-> Index Scan using idx_1 on t_data_1 t_data (cost=0.00..318.25 rows=10000 width=8)
-> Index Scan using idx_2 on t_data_2 t_data (cost=0.00..318.25 rows=10000 width=8)
-> Index Scan using idx_3 on t_data_3 t_data (cost=0.00..318.25 rows=10000 width=8)
(9 rows)

it does a nice index scan; merges the stuff and puts it up into the high level doing the windowing.

test=# select *, max(id) OVER ( ORDER BY id) from t_data LIMIT 10;
id | tstamp | max
----+------------+-----
1 | 2010-01-01 | 1
2 | 2010-01-01 | 2
3 | 2010-01-01 | 3
4 | 2010-01-01 | 4
5 | 2010-01-01 | 5
6 | 2010-01-01 | 6
7 | 2010-01-01 | 7
8 | 2010-01-01 | 8
9 | 2010-01-01 | 9
10 | 2010-01-01 | 10
(10 rows)

the cost model does what it should as well:

test=# explain select *, max(id) OVER ( ORDER BY id) from t_data ;
QUERY PLAN
---------------------------------------------------------------------------------------------
WindowAgg (cost=2872.41..3434.86 rows=32140 width=8)
-> Sort (cost=2872.41..2952.76 rows=32140 width=8)
Sort Key: public.t_data.id
-> Result (cost=0.00..466.40 rows=32140 width=8)
-> Append (cost=0.00..466.40 rows=32140 width=8)
-> Seq Scan on t_data (cost=0.00..31.40 rows=2140 width=8)
-> Seq Scan on t_data_1 t_data (cost=0.00..145.00 rows=10000 width=8)
-> Seq Scan on t_data_2 t_data (cost=0.00..145.00 rows=10000 width=8)
-> Seq Scan on t_data_3 t_data (cost=0.00..145.00 rows=10000 width=8)
(9 rows)

it has proven to be really valuable in my first tests.
maybe this is helpful for some people out there.

many thanks,

hans

Attachments:

merge-append-91-v1.diffapplication/octet-stream; name=merge-append-91-v1.diffDownload
diff -durpN pgsql.orig/src/backend/executor/nodeAppend.c pgsql/src/backend/executor/nodeAppend.c
--- pgsql.orig/src/backend/executor/nodeAppend.c	2010-07-13 10:51:01.000000000 +0200
+++ pgsql/src/backend/executor/nodeAppend.c	2010-09-03 10:50:49.000000000 +0200
@@ -59,9 +59,25 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "utils/lsyscache.h"
+#include "access/nbtree.h"
+
+/* It gets quite confusing having a heap array (indexed by integers) which
+ * contains integers which index into the slots array. These typedefs try to
+ * clear it up but without making simple inline accessing functions they don't
+ * actually produce any warnings on mistakes */
+
+typedef int SlotNumber;
+typedef int HeapPosition;
+
+#define WHICHPLAN_PLANS_UNINITIALIZED (-1)
 
 static bool exec_append_initialize_next(AppendState *appendstate);
 
+static int heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2);
+
+static void heap_siftup_slot(AppendState *node);
+static void heap_insert_slot(AppendState *node, SlotNumber new_slot);
 
 /* ----------------------------------------------------------------
  *		exec_append_initialize_next
@@ -142,6 +158,7 @@ ExecInitAppend(Append *node, EState *est
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_is_ordered = node->isOrdered;
 
 	/*
 	 * Miscellaneous initialization
@@ -175,11 +192,53 @@ ExecInitAppend(Append *node, EState *est
 	ExecAssignResultTypeFromTL(&appendstate->ps);
 	appendstate->ps.ps_ProjInfo = NULL;
 
-	/*
-	 * initialize to scan first subplan
-	 */
-	appendstate->as_whichplan = 0;
-	exec_append_initialize_next(appendstate);
+	if (!appendstate->as_is_ordered)
+	{
+		/*
+		 * initialize to scan first subplan
+		 */
+		appendstate->as_whichplan = 0;
+		exec_append_initialize_next(appendstate);
+	}
+	else
+	{
+		/* set up scan keys and initialize *all* the subnodes */
+		int i;
+
+		appendstate->as_nkeys = node->numCols;
+		appendstate->as_scankeys = palloc(sizeof(ScanKeyData) *  node->numCols);
+		appendstate->as_slots = palloc(sizeof(TupleTableSlot *) * nplans);
+		appendstate->as_heap = palloc(sizeof(int) * nplans);
+		appendstate->as_heap_size = 0;
+
+		for (i=0; i < nplans; i++)
+		{
+			appendstate->as_whichplan = i;
+			exec_append_initialize_next(appendstate);
+		}
+
+		appendstate->as_whichplan = WHICHPLAN_PLANS_UNINITIALIZED;
+
+		for (i = 0; i < node->numCols; i++)
+		{
+			Oid sortFunction;
+			bool reverse;
+
+			get_compare_function_for_ordering_op(node->sortOperators[i],
+												 &sortFunction, &reverse);
+
+			ScanKeyInit(&appendstate->as_scankeys[i],
+						node->sortColIdx[i],
+						InvalidStrategy,
+						sortFunction,
+						(Datum)0);
+
+			if (reverse)
+				appendstate->as_scankeys[i].sk_flags |= SK_BT_DESC;
+			if (node->nullsFirst[i])
+				appendstate->as_scankeys[i].sk_flags |= SK_BT_NULLS_FIRST;
+		}
+	}
 
 	return appendstate;
 }
@@ -193,45 +252,169 @@ ExecInitAppend(Append *node, EState *est
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	for (;;)
+	if (!node->as_is_ordered)
 	{
-		PlanState  *subnode;
-		TupleTableSlot *result;
+		for (;;)
+		{
+			PlanState  *subnode;
+			TupleTableSlot *result;
 
-		/*
-		 * figure out which subplan we are currently processing
-		 */
-		subnode = node->appendplans[node->as_whichplan];
+			/*
+			 * figure out which subplan we are currently processing
+			 */
+			subnode = node->appendplans[node->as_whichplan];
 
-		/*
-		 * get a tuple from the subplan
-		 */
-		result = ExecProcNode(subnode);
+			/*
+			 * get a tuple from the subplan
+			 */
+			result = ExecProcNode(subnode);
+
+			if (!TupIsNull(result))
+			{
+				/*
+				 * If the subplan gave us something then return it as-is. We do
+				 * NOT make use of the result slot that was set up in
+				 * ExecInitAppend; there's no need for it.
+				 */
+				return result;
+			}
 
-		if (!TupIsNull(result))
-		{
 			/*
-			 * If the subplan gave us something then return it as-is. We do
-			 * NOT make use of the result slot that was set up in
-			 * ExecInitAppend; there's no need for it.
+			 * Go on to the "next" subplan in the appropriate direction. If no
+			 * more subplans, return the empty slot set up for us by
+			 * ExecInitAppend.
 			 */
-			return result;
+			if (ScanDirectionIsForward(node->ps.state->es_direction))
+				node->as_whichplan++;
+			else
+				node->as_whichplan--;
+			if (!exec_append_initialize_next(node))
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+			/* Else loop back and try to get a tuple from the new subplan */
 		}
+	}
+	else
+	{
+		TupleTableSlot *result;
+		SlotNumber i;
 
-		/*
-		 * Go on to the "next" subplan in the appropriate direction. If no
-		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
-		 */
-		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
-		else
-			node->as_whichplan--;
-		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		if (node->as_whichplan == WHICHPLAN_PLANS_UNINITIALIZED)
+		{
+			for (i = 0; i < node->as_nplans; i++)
+			{
+				node->as_slots[i] = ExecProcNode(node->appendplans[i]);
+				if (!TupIsNull(node->as_slots[i]))
+					heap_insert_slot(node, i);
+			}
+                }
+                else
+                {
+                	i = node->as_whichplan;
+                	node->as_slots[i] = ExecProcNode(node->appendplans[i]);
+                	if (TupIsNull(node->as_slots[i]))
+                		;
+			else if (node->as_heap_size <= 0 || heap_compare_slots(node, node->as_heap[0], i) >= 0)
+				return node->as_slots[i];
+			else 
+				heap_insert_slot(node, i);
+                }
 
-		/* Else loop back and try to get a tuple from the new subplan */
+                if (node->as_heap_size > 0)
+                {
+                	i = node->as_heap[0];
+                	node->as_whichplan = i;
+                	heap_siftup_slot(node);
+                	result = node->as_slots[i];
+		} else {
+			result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		return result;
+	}
+}
+
+static void
+heap_insert_slot(AppendState *node, SlotNumber new_slot)
+{
+	HeapPosition j;
+	SlotNumber *heap;
+
+	Assert(!TupIsNull(node->as_slots[new_slot]));
+
+	j = node->as_heap_size++;
+	heap = node->as_heap;
+
+	while (j > 0)
+	{
+		int i = (j-1)/2;
+		if (heap_compare_slots(node, new_slot, node->as_heap[i]) >= 0)
+			break;
+		heap[j] = heap[i];
+		j = i;
 	}
+	heap[j] = new_slot;
+}
+
+static void
+heap_siftup_slot(AppendState *node)
+{
+	HeapPosition i=0, j, n;
+
+	if (--node->as_heap_size <= 0)
+		return;
+	n = node->as_heap_size;
+
+	for (;;)
+	{
+		j = 2 * i + 1;
+		if (j >= n)
+			break;
+		if (j+1 < n && heap_compare_slots(node, node->as_heap[j], node->as_heap[j+1]) > 0)
+			j++;
+		if (heap_compare_slots(node, node->as_heap[n], node->as_heap[j]) <= 0)
+			break;
+		node->as_heap[i] = node->as_heap[j];
+		i = j;
+	}
+	node->as_heap[i] = node->as_heap[n];
+}
+
+static int
+heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2)
+{
+	int nkey;
+	TupleTableSlot *s1, *s2;
+
+	Assert(slot1 < node->as_nplans);
+	Assert(slot2 < node->as_nplans);
+
+	s1 = node->as_slots[slot1];
+	s2 = node->as_slots[slot2];
+	Assert(!TupIsNull(s1));
+	Assert(!TupIsNull(s2));
+
+	for (nkey = 0; nkey < node->as_nkeys; nkey++)
+	{
+		ScanKey scankey = node->as_scankeys + nkey;
+		AttrNumber attno = scankey->sk_attno;
+		Datum d1, d2;
+		bool  n1, n2;
+		int compare;
+
+		d1 = slot_getattr(s1, attno, &n1);
+		d2 = slot_getattr(s2, attno, &n2);
+
+		if (n1 && !n2)
+			return (scankey->sk_flags & SK_BT_NULLS_FIRST) ? -1 :  1;
+		if (!n1 && n2)
+			return (scankey->sk_flags & SK_BT_NULLS_FIRST) ?  1 : -1;
+
+		compare = FunctionCall2(&scankey->sk_func, d1, d2);
+		if (compare != 0)
+			return (scankey->sk_flags & SK_BT_DESC) ? -compare : compare;
+	}
+	return 0;
 }
 
 /* ----------------------------------------------------------------
diff -durpN pgsql.orig/src/backend/optimizer/path/allpaths.c pgsql/src/backend/optimizer/path/allpaths.c
--- pgsql.orig/src/backend/optimizer/path/allpaths.c	2010-03-29 00:59:32.000000000 +0200
+++ pgsql/src/backend/optimizer/path/allpaths.c	2010-09-03 10:55:18.000000000 +0200
@@ -51,7 +51,7 @@ static void set_plain_rel_pathlist(Plann
 					   RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void set_dummy_rel_pathlist(RelOptInfo *rel);
+static void set_dummy_rel_pathlist(PlannerInfo *root, RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -222,7 +222,7 @@ set_plain_rel_pathlist(PlannerInfo *root
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		relation_excluded_by_constraints(root, rel, rte))
 	{
-		set_dummy_rel_pathlist(rel);
+		set_dummy_rel_pathlist(root, rel);
 		return;
 	}
 
@@ -267,6 +267,22 @@ set_plain_rel_pathlist(PlannerInfo *root
 	set_cheapest(rel);
 }
 
+
+/* It's possible that the child is itself an appendrel, in which case
+ * we can "cut out the middleman" and just add its child paths to our
+ * own list.  (We don't try to do this earlier because we need to
+ * apply both levels of transformation to the quals.)
+ */
+
+#define LAPPEND_PATH_FLATTEN_APPENDPATHS(subpaths_, childpath_)			\
+	do {																\
+		if (IsA((childpath_), AppendPath))								\
+			(subpaths_) = list_concat((subpaths_),						\
+									 ((AppendPath *) (childpath_))->subpaths); \
+		else															\
+			subpaths_ = lappend(subpaths_, childpath_);					\
+	} while (0)
+
 /*
  * set_append_rel_pathlist
  *	  Build access paths for an "append relation"
@@ -283,12 +299,13 @@ set_append_rel_pathlist(PlannerInfo *roo
 						Index rti, RangeTblEntry *rte)
 {
 	int			parentRTindex = rti;
-	List	   *subpaths = NIL;
+	List	   *best_subpaths = NIL;	/* list of paths */
+	List	   *all_pathkeys = NIL;		/* union of all pathkeys in sub paths */
 	double		parent_rows;
 	double		parent_size;
 	double	   *parent_attrsizes;
 	int			nattrs;
-	ListCell   *l;
+	ListCell   *l, *ll, *lll;
 
 	/*
 	 * Initialize to compute size estimates for whole append relation.
@@ -366,7 +383,7 @@ set_append_rel_pathlist(PlannerInfo *roo
 			 * Restriction reduces to constant FALSE or constant NULL after
 			 * substitution, so this child need not be scanned.
 			 */
-			set_dummy_rel_pathlist(childrel);
+			set_dummy_rel_pathlist(root, childrel);
 			continue;
 		}
 		childquals = make_ands_implicit((Expr *) childqual);
@@ -381,7 +398,7 @@ set_append_rel_pathlist(PlannerInfo *roo
 			 * appendrel.  Mark it with a dummy cheapest-path though, in case
 			 * best_appendrel_indexscan() looks at it later.
 			 */
-			set_dummy_rel_pathlist(childrel);
+			set_dummy_rel_pathlist(root, childrel);
 			continue;
 		}
 
@@ -397,7 +414,9 @@ set_append_rel_pathlist(PlannerInfo *roo
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.
 		 */
+#ifdef FIXME
 		if (rel->has_eclass_joins)
+#endif
 		{
 			add_child_rel_equivalences(root, appinfo, rel, childrel);
 			childrel->has_eclass_joins = true;
@@ -413,21 +432,34 @@ set_append_rel_pathlist(PlannerInfo *roo
 
 		/*
 		 * Compute the child's access paths, and add the cheapest one to the
-		 * Append path we are constructing for the parent.
-		 *
-		 * It's possible that the child is itself an appendrel, in which case
-		 * we can "cut out the middleman" and just add its child paths to our
-		 * own list.  (We don't try to do this earlier because we need to
-		 * apply both levels of transformation to the quals.)
+ 		 * list of cheap paths (best_paths) and the pathkeys of all the paths
+ 		 * to the union, all_pathkeys.
 		 */
 		set_rel_pathlist(root, childrel, childRTindex, childRTE);
 
 		childpath = childrel->cheapest_total_path;
-		if (IsA(childpath, AppendPath))
-			subpaths = list_concat(subpaths,
-								   ((AppendPath *) childpath)->subpaths);
-		else
-			subpaths = lappend(subpaths, childpath);
+ 		LAPPEND_PATH_FLATTEN_APPENDPATHS(best_subpaths, childpath);
+ 
+		/* gather up all the pathkeys of all sub-paths
+		 * FIXME -- this is O(n^2)!
+		 */
+		foreach(ll, childrel->pathlist)
+		{
+			ListCell *lll;
+			Path *childpath = (Path *) lfirst(ll);
+			bool found = false;
+			
+			foreach (lll, all_pathkeys)
+			{
+				List *existing_pathkeys = (List *)lfirst(lll);
+ 
+				if (compare_pathkeys(existing_pathkeys,
+									 childpath->pathkeys) == PATHKEYS_EQUAL)
+					found = true;
+			}
+			if (!found)
+				all_pathkeys = lappend(all_pathkeys, childpath->pathkeys);
+		}
 
 		/*
 		 * Accumulate size information from each child.
@@ -487,10 +519,74 @@ set_append_rel_pathlist(PlannerInfo *roo
 	 * the parent rel.	(Note: this is correct even if we have zero or one
 	 * live subpath due to constraint exclusion.)
 	 */
-	add_path(rel, (Path *) create_append_path(rel, subpaths));
+	if (list_length(best_subpaths) == 1)
+	{
+		/* special case for a singleton append node, just grab the pathlist
+		 * from the child rels and use it directly.
+		 */
+		rel->pathlist = ((Path*) linitial(best_subpaths))->parent->pathlist;
+		set_cheapest(rel);
+		return;
+	}
 
-	/* Select cheapest path (pretty easy in this case...) */
-	set_cheapest(rel);
+	/* First add the unordered append path */
+	add_path(rel, (Path *) create_append_path(root, rel, best_subpaths, NIL));
+
+	/* Now try to add various ordered append paths */
+	foreach (ll, all_pathkeys)
+	{
+		List *pathkeys = (List *)lfirst(ll);
+		List *startup_subpaths = NIL; /* subpaths for minimal startup cost append path */
+		List *total_subpaths = NIL;   /* subpaths for minimal total cost append path */
+
+		/* populate both subpaths lists based on this pathkeys list */
+		foreach(lll, root->append_rel_list)
+		{
+			AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lll);
+			RelOptInfo *childrel;
+			Path	   *childpath;
+
+			/* append_rel_list contains all append rels; ignore others */
+			if (appinfo->parent_relid != parentRTindex)
+				continue;
+
+			childrel = find_base_rel(root, appinfo->child_relid);
+			Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+
+			/* First the cheapest startup cost */
+			childpath = get_cheapest_path_for_pathkeys(childrel->pathlist, 
+													   pathkeys,
+													   TOTAL_COST);
+
+			/* If we can't find any plans with the right order just add the
+			 * cheapest total plan to both paths, we'll have to sort it
+			 * anyways. Perhaps consider not generating a startup path for such
+			 * pathkeys since it'll be unlikely to be the cheapest startup
+			 * cost. */
+
+			if (!childpath) {
+				childpath = childrel->cheapest_total_path;
+				LAPPEND_PATH_FLATTEN_APPENDPATHS(startup_subpaths, childpath);
+				LAPPEND_PATH_FLATTEN_APPENDPATHS(total_subpaths, childpath);
+				continue;
+			}
+
+			LAPPEND_PATH_FLATTEN_APPENDPATHS(total_subpaths, childpath);
+
+			/* Also do the the cheapest startup cost */
+			childpath = get_cheapest_path_for_pathkeys(childrel->pathlist, 
+													   pathkeys,
+													   STARTUP_COST);
+			Assert(childpath); /* above special case ought to have caught this */
+			LAPPEND_PATH_FLATTEN_APPENDPATHS(startup_subpaths, childpath);
+		}
+
+		add_path(rel, (Path *) create_append_path(root, rel, startup_subpaths, pathkeys));
+		add_path(rel, (Path *) create_append_path(root, rel, total_subpaths, pathkeys));
+	}
+
+	/* Select cheapest path */
+  	set_cheapest(rel);
 }
 
 /*
@@ -501,13 +597,13 @@ set_append_rel_pathlist(PlannerInfo *roo
  * AppendPath with no members (see also IS_DUMMY_PATH macro).
  */
 static void
-set_dummy_rel_pathlist(RelOptInfo *rel)
+set_dummy_rel_pathlist(PlannerInfo *root, RelOptInfo *rel)
 {
 	/* Set dummy size estimates --- we leave attr_widths[] as zeroes */
 	rel->rows = 0;
 	rel->width = 0;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL));
+	add_path(rel, (Path *) create_append_path(root, rel, NIL, NIL));
 
 	/* Select cheapest path (pretty easy in this case...) */
 	set_cheapest(rel);
diff -durpN pgsql.orig/src/backend/optimizer/path/equivclass.c pgsql/src/backend/optimizer/path/equivclass.c
--- pgsql.orig/src/backend/optimizer/path/equivclass.c	2010-02-26 03:00:44.000000000 +0100
+++ pgsql/src/backend/optimizer/path/equivclass.c	2010-09-03 10:20:58.000000000 +0200
@@ -1634,9 +1634,10 @@ add_child_rel_equivalences(PlannerInfo *
 		 * Won't generate joinclauses if const or single-member (the latter
 		 * test covers the volatile case too)
 		 */
+#if FIXME
 		if (cur_ec->ec_has_const || list_length(cur_ec->ec_members) <= 1)
 			continue;
-
+#endif
 		/* No point in searching if parent rel not mentioned in eclass */
 		if (!bms_is_subset(parent_rel->relids, cur_ec->ec_relids))
 			continue;
diff -durpN pgsql.orig/src/backend/optimizer/path/joinpath.c pgsql/src/backend/optimizer/path/joinpath.c
--- pgsql.orig/src/backend/optimizer/path/joinpath.c	2010-04-19 09:42:28.000000000 +0200
+++ pgsql/src/backend/optimizer/path/joinpath.c	2010-09-03 10:20:58.000000000 +0200
@@ -955,7 +955,7 @@ best_appendrel_indexscan(PlannerInfo *ro
 		return NULL;
 
 	/* Form and return the completed Append path. */
-	return (Path *) create_append_path(rel, append_paths);
+	return (Path *) create_append_path(root, rel, append_paths, NIL);
 }
 
 /*
diff -durpN pgsql.orig/src/backend/optimizer/path/joinrels.c pgsql/src/backend/optimizer/path/joinrels.c
--- pgsql.orig/src/backend/optimizer/path/joinrels.c	2010-02-26 03:00:45.000000000 +0100
+++ pgsql/src/backend/optimizer/path/joinrels.c	2010-09-03 10:20:58.000000000 +0200
@@ -932,7 +932,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL));
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL));
 
 	/* Set or update cheapest_total_path */
 	set_cheapest(rel);
diff -durpN pgsql.orig/src/backend/optimizer/plan/createplan.c pgsql/src/backend/optimizer/plan/createplan.c
--- pgsql.orig/src/backend/optimizer/plan/createplan.c	2010-07-13 10:51:03.000000000 +0200
+++ pgsql/src/backend/optimizer/plan/createplan.c	2010-09-03 10:27:46.000000000 +0200
@@ -24,6 +24,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
+#include "optimizer/paths.h"
 #include "optimizer/plancat.h"
 #include "optimizer/planmain.h"
 #include "optimizer/predtest.h"
@@ -609,11 +610,19 @@ create_append_plan(PlannerInfo *root, Ap
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
+		Plan	   *subplan;
 
-		subplans = lappend(subplans, create_plan_recurse(root, subpath));
+		subplan = create_plan(root, subpath);
+		if (!pathkeys_contained_in(best_path->path.pathkeys, subpath->pathkeys))
+			subplan = (Plan *)make_sort_from_pathkeys(root,
+													  subplan,
+													  best_path->path.pathkeys,
+													  -1.0);
+
+		subplans = lappend(subplans, subplan);
 	}
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(root, subplans, tlist, best_path->path.pathkeys);
 
 	return (Plan *) plan;
 }
@@ -2784,7 +2793,8 @@ make_worktablescan(List *qptlist,
 }
 
 Append *
-make_append(List *appendplans, List *tlist)
+make_append(PlannerInfo *root, List *appendplans,
+				   List *tlist, List *pathkeys)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -2792,9 +2802,11 @@ make_append(List *appendplans, List *tli
 	ListCell   *subnode;
 
 	/*
-	 * Compute cost as sum of subplan costs.  We charge nothing extra for the
-	 * Append itself, which perhaps is too optimistic, but since it doesn't do
-	 * any selection or projection, it is a pretty cheap node.
+	 * Compute cost as sum of subplan costs. We charge nothing extra for a
+	 * plain Append itself, which perhaps is too optimistic, but since it
+	 * doesn't do any selection or projection, it is a pretty cheap node. In
+	 * the case of an ordered append we construct an equivalent bounded Sort
+	 * node and steal the cost calculations from it.
 	 */
 	plan->startup_cost = 0;
 	plan->total_cost = 0;
@@ -2804,8 +2816,10 @@ make_append(List *appendplans, List *tli
 	{
 		Plan	   *subplan = (Plan *) lfirst(subnode);
 
-		if (subnode == list_head(appendplans))	/* first node? */
-			plan->startup_cost = subplan->startup_cost;
+ 		/* If it's ordered then the startup cost is the sum of the startup
+ 		 * costs, otherwise it's the startup cost of just the first plan */
+ 		if (pathkeys || subnode == list_head(appendplans))
+ 			plan->startup_cost += subplan->startup_cost;
 		plan->total_cost += subplan->total_cost;
 		plan->plan_rows += subplan->plan_rows;
 		total_size += subplan->plan_width * subplan->plan_rows;
@@ -2821,6 +2835,30 @@ make_append(List *appendplans, List *tli
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 
+	if (!pathkeys)
+		node->isOrdered = false;
+	else 
+	{
+		/* generate a throwaway sort node to find the sort functions */
+		Sort *tmp = make_sort_from_pathkeys(root, (Plan*)node, pathkeys, list_length(appendplans));
+		
+		node->isOrdered = true;
+		
+		node->numCols 		= tmp->numCols;
+		node->sortColIdx 	= tmp->sortColIdx;
+		node->sortOperators = tmp->sortOperators;
+		node->nullsFirst 	= tmp->nullsFirst;
+
+		/* a limited sort is the same kind of work (bounded heap sort) as an
+		 * ordered append with the bound set to the number of plans, so we just
+		 * use that to calculate the total cost. The startup cost is just the
+		 * sum of the startup costs of the nodes plus a bit to build the heap
+		 * (similar to processing that many rows in the bounded sort case).
+		 */
+		plan->total_cost = tmp->plan.total_cost - (enable_sort ? 0 : disable_cost);
+		plan->startup_cost += plan->total_cost / plan->plan_rows * list_length(appendplans);
+	}
+
 	return node;
 }
 
@@ -3182,9 +3220,10 @@ make_sort_from_pathkeys(PlannerInfo *roo
 			{
 				EquivalenceMember *em = (EquivalenceMember *) lfirst(j);
 
+#ifdef FIXME
 				if (em->em_is_const || em->em_is_child)
 					continue;
-
+#endif
 				tle = tlist_member((Node *) em->em_expr, tlist);
 				if (tle)
 				{
@@ -3218,8 +3257,10 @@ make_sort_from_pathkeys(PlannerInfo *roo
 					List	   *exprvars;
 					ListCell   *k;
 
+#ifdef FIXME
 					if (em->em_is_const || em->em_is_child)
 						continue;
+#endif
 					sortexpr = em->em_expr;
 					exprvars = pull_var_clause((Node *) sortexpr,
 											   PVC_INCLUDE_PLACEHOLDERS);
diff -durpN pgsql.orig/src/backend/optimizer/prep/prepunion.c pgsql/src/backend/optimizer/prep/prepunion.c
--- pgsql.orig/src/backend/optimizer/prep/prepunion.c	2010-07-11 11:14:51.000000000 +0200
+++ pgsql/src/backend/optimizer/prep/prepunion.c	2010-09-03 10:22:43.000000000 +0200
@@ -449,7 +449,7 @@ generate_union_plan(SetOperationStmt *op
 	/*
 	 * Append the child results together.
 	 */
-	plan = (Plan *) make_append(planlist, tlist);
+	plan = (Plan *) make_append(root, planlist, tlist, NIL);
 
 	/*
 	 * For UNION ALL, we just need the Append plan.  For UNION, need to add
@@ -540,7 +540,7 @@ generate_nonunion_plan(SetOperationStmt 
 	/*
 	 * Append the child results together.
 	 */
-	plan = (Plan *) make_append(planlist, tlist);
+	plan = (Plan *) make_append(root, planlist, tlist, NIL);
 
 	/* Identify the grouping semantics */
 	groupList = generate_setop_grouplist(op, tlist);
diff -durpN pgsql.orig/src/backend/optimizer/util/pathnode.c pgsql/src/backend/optimizer/util/pathnode.c
--- pgsql.orig/src/backend/optimizer/util/pathnode.c	2010-03-29 00:59:33.000000000 +0200
+++ pgsql/src/backend/optimizer/util/pathnode.c	2010-09-03 10:20:58.000000000 +0200
@@ -635,31 +635,74 @@ create_tidscan_path(PlannerInfo *root, R
 /*
  * create_append_path
  *	  Creates a path corresponding to an Append plan, returning the
- *	  pathnode.
+ *	  pathnode. 
+ *
+ *    Note that we must support create_append_path(null, rel, nil, nil) which
+ *    is used to create dummy plans.
  */
+#define LOG2(x)  (log(x) / 0.693147180559945)
+
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths)
+create_append_path(PlannerInfo *root, RelOptInfo *rel, List *subpaths, List *pathkeys)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
+	Cost total_cost=0, startup_cost=0;
+	Cost this_total_cost, this_startup_cost;
+	int npaths = 0;
 
 	pathnode->path.pathtype = T_Append;
 	pathnode->path.parent = rel;
-	pathnode->path.pathkeys = NIL;		/* result is always considered
-										 * unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
 
-	pathnode->path.startup_cost = 0;
-	pathnode->path.total_cost = 0;
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
 
-		if (l == list_head(subpaths))	/* first node? */
-			pathnode->path.startup_cost = subpath->startup_cost;
-		pathnode->path.total_cost += subpath->total_cost;
+		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))  
+		{
+			this_startup_cost = subpath->startup_cost;
+			this_total_cost   = subpath->total_cost;
+		} else {
+			Path sort_path; /* dummy for result of cost_sort */
+			cost_sort(&sort_path,
+					  root,
+					  pathkeys,
+					  subpath->total_cost, 
+					  subpath->parent->tuples, 
+					  subpath->parent->width, 
+					  -1.0);
+			this_total_cost   = sort_path.total_cost;
+			this_startup_cost = sort_path.startup_cost;
+		}
+		
+		npaths++;
+		total_cost += this_total_cost;
+		/* If it's unsorted the startup cost is just the first subpath's
+		 * startup cost, otherwise it's the sum of all startup costs */
+		if (pathkeys != NIL || l == list_head(subpaths))
+			startup_cost += this_startup_cost;
+	}
+
+	/* The cost of merging using a heapsort */
+	if (pathkeys != NIL)
+	{
+		Path sort_path;
+		cost_sort(&sort_path,
+				  root,
+				  pathkeys,
+				  total_cost,
+				  rel->rows,
+				  rel->width, 
+				  npaths);
+		total_cost = sort_path.total_cost - (enable_sort ? 0 : disable_cost);
+		startup_cost += total_cost / rel->rows * npaths;
 	}
 
+	pathnode->path.total_cost = total_cost;
+	pathnode->path.startup_cost = startup_cost;
+	
 	return pathnode;
 }
 
diff -durpN pgsql.orig/src/include/nodes/execnodes.h pgsql/src/include/nodes/execnodes.h
--- pgsql.orig/src/include/nodes/execnodes.h	2010-07-29 11:33:00.000000000 +0200
+++ pgsql/src/include/nodes/execnodes.h	2010-09-03 10:46:22.000000000 +0200
@@ -1047,6 +1047,13 @@ typedef struct AppendState
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
 	int			as_whichplan;
+
+	bool		as_is_ordered;
+	int			as_nkeys;
+	ScanKey		as_scankeys;		/* array of length as_nkeys */
+	TupleTableSlot **as_slots;		/* array of length as_nplans */
+	int			*as_heap;	/* array of length as_nplans */
+	int			as_heap_size;	/* how many slots are present in the heap */
 } AppendState;
 
 /* ----------------
diff -durpN pgsql.orig/src/include/nodes/plannodes.h pgsql/src/include/nodes/plannodes.h
--- pgsql.orig/src/include/nodes/plannodes.h	2010-07-13 10:51:07.000000000 +0200
+++ pgsql/src/include/nodes/plannodes.h	2010-09-03 10:47:41.000000000 +0200
@@ -180,6 +180,14 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	bool		isOrdered;
+
+	/* used only if the append is an ordered append */
+
+	int		numCols;	/* number of sort-key columns */
+	AttrNumber *sortColIdx;		/* their indexes in the target list */
+	Oid	   *sortOperators;	/* OIDs of operators to sort them by */
+	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Append;
 
 /* ----------------
diff -durpN pgsql.orig/src/include/optimizer/pathnode.h pgsql/src/include/optimizer/pathnode.h
--- pgsql.orig/src/include/optimizer/pathnode.h	2010-03-29 00:59:33.000000000 +0200
+++ pgsql/src/include/optimizer/pathnode.h	2010-09-03 10:20:58.000000000 +0200
@@ -46,7 +46,10 @@ extern BitmapOrPath *create_bitmap_or_pa
 					  List *bitmapquals);
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals);
-extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths);
+extern AppendPath *create_append_path(PlannerInfo *root,
+									  RelOptInfo *rel, 
+									  List *subpaths,
+									  List *pathkeys);
 extern ResultPath *create_result_path(List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
diff -durpN pgsql.orig/src/include/optimizer/planmain.h pgsql/src/include/optimizer/planmain.h
--- pgsql.orig/src/include/optimizer/planmain.h	2010-03-29 00:59:33.000000000 +0200
+++ pgsql/src/include/optimizer/planmain.h	2010-09-03 10:44:50.000000000 +0200
@@ -43,7 +43,8 @@ extern Node *fix_indexqual_operand(Node 
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan,
 				  List *subrtable, List *subrowmark);
-extern Append *make_append(List *appendplans, List *tlist);
+extern Append *make_append(PlannerInfo *root, List *appendplans,
+					 List *tlist, List *pathkeys);
 extern RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree, Plan *righttree, int wtParam,
 					 List *distinctList, long numGroups);
#9Robert Haas
robertmhaas@gmail.com
In reply to: Hans-Jürgen Schönig (#8)
Re: Path question

2010/9/3 Hans-Jürgen Schönig <hs@cybertec.at>:

On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:

I agree. Explicit partitioning may open up some additional optimization possibilities in certain cases, but Merge Append is more general and extremely valuable in its own right.

we have revised greg's wonderful work and ported the entire thing to head.
it solves the problem of merge_append. i did some testing earlier on today and it seems most important cases are working nicely.

First, thanks for merging this up to HEAD. I took a look through this
patch tonight, and the previous reviews thereof that I was able to
find, most notably Tom's detailed review on 2009-07-26. I'm not sure
whether or not it's accidental that this didn't get added to the CF,
but here's an attempt to enumerate the things that seem like they need
to be fixed. The quotes labeled "TGL" are from the aforementioned
review by Tom.

1. The code in set_append_rel_pathlist() that accumulates the pathkeys
of all sub-paths is, as it says, and as previous discussed, O(n^2).
In a previous email on this topic, Tom suggested on possible approach
for this problem: choose the largest child relation and call it the
leader, and consider only the pathkeys for that relation rather than
all of them. I think it would be nice if we can find a way to be a
bit smarter, though, because that won't necessarily always find the
best path. One idea I had is to choose some arbitrary limit on how
long the all_pathkeys list is allowed to become and iterate over the
children from largest to smallest, stopping early if you hit that
limit. But thinking about it a little more, can't we just adjust the
way we do this so that it's not O(n^2)? It seems we're only concerned
with equality here, so what about using a hash table? We could hash
the PathKey pointers in each list, but not the lists or listcells
obviously.

2. TGL: "you need an explicit flag to say 'we'll do a merge', not just
rely on whether the path has pathkeys." This makes sense and doesn't
seem difficult.

3. TGL: "Speaking of sorting, it's not entirely clear to me how the
patch ensures that all the child plans produce the necessary sort keys
as output columns, and especially not how it ensures that they all get
produced in the *same* output columns. This might accidentally manage
to work because of the "throwaway" call to make_sort_from_pathkeys(),
but at the very least that's a misleading comment." I'm not sure what
needs to be done about this; I'm going to look at this further.

4. TGL: "In any case, I'm amazed that it's not failing regression
tests all over the place with those critical tests in
make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs. Perhaps
we need some more regression tests...". Obviously, we need to remove
that lobotomy and insert the correct fix for whatever problem it was
trying to solve. Adding some regression tests seems wise, too.

5. TGL: "In the same vein, the hack to 'short circuit' the append
stuff for a single child node is simply wrong, because it doesn't
allow for column number variances. Please remove it." This seems
like straightforward cleanup, and maybe another candidate for a
regression test. (Actually, I notice that the patch has NO regression
tests at all, which surely can't be right for something of this type,
though admittedly since we didn't have EXPLAIN (COSTS OFF) when this
was first written it might have been difficult to write anything
meaningful at the time.)

6. The dummy call to cost_sort() seems like a crock; what that
function does doesn't seem particularly relevant to the cost of the
merge operation. Just off the top of my head, it looks like the cost
of the merge step will be roughly O(lg n) * the cost of comparing two
tuples * the total number of tuples from all child paths. In practice
it might be less, because once some of the paths run out of tuples the
number of comparisons will drop, I think. But the magnitude of that
effect seems difficult to predict, and may be rather small, so perhaps
we should just ignore it.

7. I think there's some basic code cleanup needed here, also: comment
formatting, variable naming, etc.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#10David Fetter
david@fetter.org
In reply to: Robert Haas (#9)
Re: Path question

On Mon, Sep 20, 2010 at 10:57:00PM -0400, Robert Haas wrote:

2010/9/3 Hans-J�rgen Sch�nig <hs@cybertec.at>:

On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:

I agree. Explicit partitioning may open up some additional
optimization possibilities in certain cases, but Merge Append is
more general and extremely valuable in its own right.

we have revised greg's wonderful work and ported the entire thing
to head. it solves the problem of merge_append. i did some
testing earlier on today and it seems most important cases are
working nicely.

First, thanks for merging this up to HEAD. I took a look through
this patch tonight, and the previous reviews thereof that I was able
to find, most notably Tom's detailed review on 2009-07-26. I'm not
sure whether or not it's accidental that this didn't get added to
the CF,

It's because I missed putting it in, and oversight I've corrected. If
we need to bounce it on to the next one, them's the breaks.

[points elided]

7. I think there's some basic code cleanup needed here, also: comment
formatting, variable naming, etc.

Hans-J�rgen,

Will you be able to get to this in the next couple of days?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#11Robert Haas
robertmhaas@gmail.com
In reply to: David Fetter (#10)
Re: Path question

On Tue, Sep 21, 2010 at 12:29 AM, David Fetter <david@fetter.org> wrote:

On Mon, Sep 20, 2010 at 10:57:00PM -0400, Robert Haas wrote:

2010/9/3 Hans-Jürgen Schönig <hs@cybertec.at>:

On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:

I agree. Explicit partitioning may open up some additional
optimization possibilities in certain cases, but Merge Append is
more general and extremely valuable in its own right.

we have revised greg's wonderful work and ported the entire thing
to head.  it solves the problem of merge_append. i did some
testing earlier on today and it seems most important cases are
working nicely.

First, thanks for merging this up to HEAD.  I took a look through
this patch tonight, and the previous reviews thereof that I was able
to find, most notably Tom's detailed review on 2009-07-26.  I'm not
sure whether or not it's accidental that this didn't get added to
the CF,

It's because I missed putting it in, and oversight I've corrected.  If
we need to bounce it on to the next one, them's the breaks.

[points elided]

7. I think there's some basic code cleanup needed here, also: comment
formatting, variable naming, etc.

Hans-Jürgen,

Will you be able to get to this in the next couple of days?

I don't see a response to this which I assume means "no" - I'm going
to take a crack at fixing some of these issues.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In reply to: Robert Haas (#11)
Re: Path question

On Sep 23, 2010, at 3:29 PM, Robert Haas wrote:

On Tue, Sep 21, 2010 at 12:29 AM, David Fetter <david@fetter.org> wrote:

On Mon, Sep 20, 2010 at 10:57:00PM -0400, Robert Haas wrote:

2010/9/3 Hans-Jürgen Schönig <hs@cybertec.at>:

On Sep 2, 2010, at 1:20 AM, Robert Haas wrote:

I agree. Explicit partitioning may open up some additional
optimization possibilities in certain cases, but Merge Append is
more general and extremely valuable in its own right.

we have revised greg's wonderful work and ported the entire thing
to head. it solves the problem of merge_append. i did some
testing earlier on today and it seems most important cases are
working nicely.

First, thanks for merging this up to HEAD. I took a look through
this patch tonight, and the previous reviews thereof that I was able
to find, most notably Tom's detailed review on 2009-07-26. I'm not
sure whether or not it's accidental that this didn't get added to
the CF,

It's because I missed putting it in, and oversight I've corrected. If
we need to bounce it on to the next one, them's the breaks.

[points elided]

7. I think there's some basic code cleanup needed here, also: comment
formatting, variable naming, etc.

Hans-Jürgen,

Will you be able to get to this in the next couple of days?

I don't see a response to this which I assume means "no" - I'm going
to take a crack at fixing some of these issues.

hello ...

sorry for not getting back to you sooner. i am currently on the road for some days.
we got the top 3 things fixed already. however, some code seems to be relying on a sorted list somewhere(???).
we are in the process of sorting out most of the stuff.
i guess we will have something done next week.

sorry for the delay.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

#13Robert Haas
robertmhaas@gmail.com
In reply to: Hans-Jürgen Schönig (#12)
Re: Path question

2010/9/23 Hans-Jürgen Schönig <hs@cybertec.at>:

sorry for not getting back to you sooner. i am currently on the road for some days.
we got the top 3 things fixed already. however, some code seems to be relying on a sorted list somewhere(???).
we are in the process of sorting out most of the stuff.
i guess we will have something done next week.

Oh, cool. Is it possible for you to post your WIP any sooner?

I've been looking at #4 today. Further details to follow after I
finish beating my head against a wall.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#14Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#9)
1 attachment(s)
Re: Path question

2010/9/20 Robert Haas <robertmhaas@gmail.com>:

4. TGL: "In any case, I'm amazed that it's not failing regression
tests all over the place with those critical tests in
make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs.  Perhaps
we need some more regression tests...".  Obviously, we need to remove
that lobotomy and insert the correct fix for whatever problem it was
trying to solve.  Adding some regression tests seems wise, too.

So, I spent a lot of time today trying to understand why these random
FIXMEs were inserted and just exactly what they break or, uh, fix. I
didn't get very far. The regression tests all pass with the latest
version of Merge Append, with and without these FIXMEs, and in fact
when I extracted just those changes and applied them to HEAD, the
regression tests still pass. So I said to myself: "Let's come up with
a regression test that illustrates why these changes are Evil and
Dangerous, and then we can add that to the regression test suite, per
Tom's suggestion." This proved to be easier said than done:
apparently our code is so good that it works even with random sections
commented out. Go us.

As a debugging aide, instead of actually diking these tests out, I
made them elog(LOG). This has the same effect as removing them
entirely but it makes it possible to see whether you've triggered the
relevant code path. Then I tried to trigger those code paths, which I
shall hereinafter refer to as FIXME #1, FIXME #2, FIXME #3, and FIXME
#4, as per the attached patch. It proved to be pretty easy to trigger
FIXME #3; indeed, just about every query I tried did the job.

CREATE TABLE parent (a integer NOT NULL, b integer NOT NULL);
CREATE TABLE child (b integer, a integer, c text NOT NULL) INHERITS (parent);
CREATE TABLE joinme (j integer NOT NULL);

I populated joinme with a single row with the value 1, and child with
a million rows with a running from 1 to a million, b running from 10
million down to 9 million and 1, and c getting random()::text ||
random()::text || random()::text || random()::text. Then you can
trigger FIXME #3 with either of these two (the second also triggers
FIXME #4):

select * from joinme, parent where a = j and j = 1;
select * from parent order by a;

I believe that the first case fires because of the "ec_has_const"
condition and the second from the "EC only has at most one member
condition". However, it's difficult to see what problem this actually
causes. From what I gather from reading the code, em_is_child
EquivalenceMembers are only supposed to affect the construction of
inner index-scan paths; and certainly if there's at most one
EquivalenceMember it's unclear to me how you'd be forming a join.
Maybe it's possible to break something in the ec_has_const case, but I
haven't been to puzzle out what that thing is.

FIXME #1 and FIXME #2 were much harder to trigger. In fact, barring
significant further lobotimization of the code, I couldn't. For FIXME
#1, the right sort of query seems to be something like this:

select 1 from joinme left join parent on a = j where j = 1 order by a;

...but the problem is that in every example I could construct, the
em_is_child EquivalenceMembers were *after* the members for the parent
rel, so you never get far enough in the loop to see them. For related
reasons, I couldn't get FIXME #2 to fire, either: in order to have a
chance of firing FIXME #2, you have to get all the way through the
loop where FIXME #1 is located without finding a match. I can't
believe all of this code is here just for fun, but in every example I
could come up with you quickly find a match in the first loop, and
never even finish visiting all the members of that list, let alone
reach the second loop. Somehow, you need to construct a case where
the values to be sorted aren't directly emitted by the node
immediately under the sort, but I couldn't figure out how to do that -
no matter how I rearranged things, the planner just computed the
necessary outputs one level down. Anyone have an idea?

I did find one apparent weirdness in the way explain outputs sort keys, namely:

rhaas=# explain (costs off) select 2 as x union all select 1 as x order by x;
QUERY PLAN
--------------------
Sort
Sort Key: (2)
-> Append
-> Result
-> Result
(5 rows)

This seems to have nothing to do with the problem at hand, but it
looks pretty weird. The planner is in fact sorting on the column, not
on the literal value 2, so this is (I think) just a display error.
"x" would be a more reasonable representation of the sort key.

All of this leaves me wondering why Greg ended up ifdefing out this
code in the first place. There's probably something I'm missing
here... but for now I can't think of a better idea than just removing
the #ifdefs and hoping that whatever problem they were causing was
limited to an earlier version of the code that no longer exists.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachments:

planner_lobotomy.patchapplication/octet-stream; name=planner_lobotomy.patchDownload
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d86da3..f4b2c8c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -397,7 +397,9 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.
 		 */
-		if (rel->has_eclass_joins)
+		if (!rel->has_eclass_joins)
+			elog(LOG, "FIXME #4");
+
 		{
 			add_child_rel_equivalences(root, appinfo, rel, childrel);
 			childrel->has_eclass_joins = true;
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a20ed5f..b383d13 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1635,7 +1635,7 @@ add_child_rel_equivalences(PlannerInfo *root,
 		 * test covers the volatile case too)
 		 */
 		if (cur_ec->ec_has_const || list_length(cur_ec->ec_members) <= 1)
-			continue;
+			elog(LOG, "FIXME #3");
 
 		/* No point in searching if parent rel not mentioned in eclass */
 		if (!bms_is_subset(parent_rel->relids, cur_ec->ec_relids))
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fa7b29f..ec0cc8c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -3183,7 +3183,7 @@ make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 				EquivalenceMember *em = (EquivalenceMember *) lfirst(j);
 
 				if (em->em_is_const || em->em_is_child)
-					continue;
+					elog(LOG, "FIXME #1");
 
 				tle = tlist_member((Node *) em->em_expr, tlist);
 				if (tle)
@@ -3219,7 +3219,8 @@ make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 					ListCell   *k;
 
 					if (em->em_is_const || em->em_is_child)
-						continue;
+						elog(LOG, "FIXME #2");
+
 					sortexpr = em->em_expr;
 					exprvars = pull_var_clause((Node *) sortexpr,
 											   PVC_INCLUDE_PLACEHOLDERS);
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#14)
Re: Path question

Robert Haas <robertmhaas@gmail.com> writes:

FIXME #1 and FIXME #2 were much harder to trigger. In fact, barring
significant further lobotimization of the code, I couldn't.

It's not that hard if you just tweak equivclass.c to make the order of
equivalence-class lists different, viz

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a20ed5f..9528d0b 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -353,7 +353,7 @@ add_eq_member(EquivalenceClass *ec, Expr *expr, Relids relids,
 	{
 		ec->ec_relids = bms_add_members(ec->ec_relids, relids);
 	}
-	ec->ec_members = lappend(ec->ec_members, em);
+	ec->ec_members = lcons(em, ec->ec_members);

return em;
}

Then for instance:

regression=# create table t1 (f1 int);
CREATE TABLE
regression=# create table t2 () inherits (t1);
CREATE TABLE
regression=# explain select * from t1 a join t1 b using (f1);
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1
WARNING: FIXME #1

Since the order of equivalence-class member lists isn't supposed to be
semantically significant, I claim that the code in createplan has to
be able to deal with this.

Note that what this is triggering is the em_is_child condition. I think
it may indeed be impossible to get a hit on the em_is_const case as the
system currently stands; the reason being that an EC containing a const
won't normally show up as a pathkey. It can only do so if it's
below_outer_join, as the comment notes. Now the calls to
make_sort_from_pathkeys in createplan.c are only used for constructing
subsidiary sorts for a mergejoin, and we won't consider building a
mergejoin with an EC that contains a const (see
eclass_useful_for_merging). There are some calls in planner.c that are
associated with doing a final sort or distinct, but I suspect they'd
never be applied with a below_outer_join EC. So given the current usage
of make_sort_from_pathkeys it might be pretty hard to get it applied to
an EC containing a constant. That's not a reason for it not to handle
the case, though.

regards, tom lane

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: Path question

On Fri, Sep 24, 2010 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It's not that hard if you just tweak equivclass.c to make the order of
equivalence-class lists different, viz

[...]

Since the order of equivalence-class member lists isn't supposed to be
semantically significant, I claim that the code in createplan has to
be able to deal with this.

[...]

That's not a reason for it not to handle
the case, though.

I'm not disputing that those tests are correct. All I'm saying is
that I can't figure out a way to write regression tests that fail if
they are removed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#17Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#14)
Re: Path question

2010/9/23 Robert Haas <robertmhaas@gmail.com>:

All of this leaves me wondering why Greg ended up ifdefing out this
code in the first place.  There's probably something I'm missing
here...  but for now I can't think of a better idea than just removing
the #ifdefs and hoping that whatever problem they were causing was
limited to an earlier version of the code that no longer exists.

...and FAIL. I missed the blindingly obvious here, which is that
without these tests commented out, while it passes regression tests,
the merge append stuff stops working. I think for some reason without
this stuff in there the appropriate index paths fail to get generated
for the child rels.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#18Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#17)
Re: Path question

2010/9/28 Robert Haas <robertmhaas@gmail.com>:

2010/9/23 Robert Haas <robertmhaas@gmail.com>:

All of this leaves me wondering why Greg ended up ifdefing out this
code in the first place.  There's probably something I'm missing
here...  but for now I can't think of a better idea than just removing
the #ifdefs and hoping that whatever problem they were causing was
limited to an earlier version of the code that no longer exists.

...and FAIL.  I missed the blindingly obvious here, which is that
without these tests commented out, while it passes regression tests,
the merge append stuff stops working.  I think for some reason without
this stuff in there the appropriate index paths fail to get generated
for the child rels.

Yep, that's the problem, all right. find_usable_indexes() calls
build_index_pathkeys() to generate pathkeys for each available index
on the child relation, and then calls truncate_useless_pathkeys() to
get rid of any that aren't useful. In a query like this:

explain select * from ma_parent order by name limit 10;

...what makes the pathkeys potentially useful is that they match the
root pathkeys for the query. However, without Greg's hacks, the
transposed child equivalence classes don't show up in the equivalence
member lists, so get_eclass_for_sort_expr() generates new equivalence
classes for them. Subsequently, when truncate_useless_pathkeys() is
called, those new equivalence classes are found not to be equal to the
overall ordering desired for the query, so it truncates them away.

I'm not presently sure quite what the best way to fix this is... any ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#18)
Re: Path question

Robert Haas <robertmhaas@gmail.com> writes:

...what makes the pathkeys potentially useful is that they match the
root pathkeys for the query. However, without Greg's hacks, the
transposed child equivalence classes don't show up in the equivalence
member lists, so get_eclass_for_sort_expr() generates new equivalence
classes for them. Subsequently, when truncate_useless_pathkeys() is
called, those new equivalence classes are found not to be equal to the
overall ordering desired for the query, so it truncates them away.

I'm not presently sure quite what the best way to fix this is... any ideas?

Probably what's needed is to extend the "child eclass member" stuff to
cover sort-key eclasses. Per relation.h:

* em_is_child signifies that this element was built by transposing a member
* for an inheritance parent relation to represent the corresponding expression
* on an inheritance child. The element should be ignored for all purposes
* except constructing inner-indexscan paths for the child relation.

Likely these need to be ignored a bit less. A couple of Greg's
undocumented hacks seem to be related to this point, but not all of
them. In any case you'd need some careful thought about when to
ignore child EMs and when not.

regards, tom lane

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#19)
Re: Path question

On Tue, Sep 28, 2010 at 11:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

...what makes the pathkeys potentially useful is that they match the
root pathkeys for the query.  However, without Greg's hacks, the
transposed child equivalence classes don't show up in the equivalence
member lists, so get_eclass_for_sort_expr() generates new equivalence
classes for them.  Subsequently, when truncate_useless_pathkeys() is
called, those new equivalence classes are found not to be equal to the
overall ordering desired for the query, so it truncates them away.

I'm not presently sure quite what the best way to fix this is... any ideas?

Probably what's needed is to extend the "child eclass member" stuff to
cover sort-key eclasses.  Per relation.h:

 * em_is_child signifies that this element was built by transposing a member
 * for an inheritance parent relation to represent the corresponding expression
 * on an inheritance child.  The element should be ignored for all purposes
 * except constructing inner-indexscan paths for the child relation.

Likely these need to be ignored a bit less.  A couple of Greg's
undocumented hacks seem to be related to this point, but not all of
them.  In any case you'd need some careful thought about when to
ignore child EMs and when not.

Makes sense to me. It seems that there are actually two halves to
this problem: getting the child EMs to be generated in the first
place, and then getting them to be examined at the appropriate time.

The FIXME in set_append_rel_pathlist() is there to ensure that
add_child_rel_equivalences() always gets called, and the hack in
add_child_rel_equivalences() is there to ensure that child EMs are
generated unconditionally rather than only when they're useful for
merging. That doesn't seem entirely off-base. In
set_append_rel_pathlist(), I think that we shouldn't set
has_eclass_joins on the child relation unless it's set on the parent,
but calling add_child_rel_equivalences() unconditionally might be
reasonable. Similarly, in add_child_rel_equivalences(), I think that
the cur_ec->ec_has_const test should be retained, but the test on
list_length(cur_ec->ec_members) needs to be relaxed or eliminated. As
a matter of correctness, it seems like it should be OK to do this
always - the worst thing that can happen is you end up with some extra
EMs that don't (or shouldn't) do anything. However, it's possible
that, for performance, we might want to avoid generating EMs that
won't actually be needed. I'm not entirely clear on whether there is
an inexpensive way for us to know that. Perhaps we could replace the
list_length() test with a call to has_useful_pathkeys() and just
accept that there will be some false positives.

The two FIXMEs in make_sort_from_pathkeys() are there to ensure that
we find the child EMs when we go to generate a sort. There's no
reason that I can see to remove the em_has_const test, but it looks
like we need to remove the em_is_child test. We need to match each
pathkey to an element of the child's target list, which means we must
consider an EM that appears in the child's target list, which is means
a child EM or nothing. Testing reveals that if you keep Greg's hacks
to add the child EMs but remove the hacks from
make_sort_from_pathkeys(), it still works IF every child can use an
index scan, but if you need to do an index scan on some children and
sort others then you get:

ERROR: could not find pathkey item to sort

...which seems consistent with the above analysis. If we accept that
it's necessary, that still leaves the question of whether it's safe.
I'm a little less certain on this point, but it seems like it ought to
be OK. As far as I'm aware, we currently never sort append children.
Indeed, unless equivalence classes for those append children were
getting created by some mechanism other than
add_child_rel_equivalences(), it seems like it ought to fail with the
above error messages. And on the flip side it should be impossible
for us to pick up a child EM when we meant to get some other one
because they shouldn't be equal().

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#21Greg Stark
gsstark@mit.edu
In reply to: Robert Haas (#14)
Re: Path question

2010/9/23 Robert Haas <robertmhaas@gmail.com>:

All of this leaves me wondering why Greg ended up ifdefing out this
code in the first place.  There's probably something I'm missing
here...  but for now I can't think of a better idea than just removing
the #ifdefs and hoping that whatever problem they were causing was
limited to an earlier version of the code that no longer exists.

Sorry, I haven't gone completely AWOL, I just hadn't noticed this
thread. pgsql-hackers turns out to be kind of a lot of traffic if
you're not reading it continuously.

As you subsequently discovered I added these FIXMEs because without
them the paths just didn't compare equal when it was comparing the
paths of the children with the paths of the parent.

I think the reason you had difficulty demonstrating problems with at
least some of the FIXMEs was that they really aren't functionally
necessary. They're there because when Tom implemented the equivalence
classes he had a clear idea of what they were supposed to represent
and what variables they needed to include to represent that. And
including variables of child relations or subqueries of a UNION in an
equivalence class representing the parent relation just didn't fit
within that. It doesn't necessarily cause problems but it changes the
data structure representation invariant from what he had in mind.

I don't have a good grasp of exactly what the implications of changing
this data structure rep invariant are though. Is having hundreds or
thousands of variables in a single equivalence class (actually for a
whole bunch if the pathkey list is long) going to cause performance
problems? Is including variables that are only present for one child
of the relation going to limit the usefulness of the equivalence class
data structure for solving other problems where those columns really
aren't equivalent? Also, since I don't really understand what's going
on I wondered if there wasn't an obvious way to accomplish the same
thing perhaps more efficiently.

--
greg

#22Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#20)
1 attachment(s)
Re: Path question

On Wed, Sep 29, 2010 at 11:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Makes sense to me.  It seems that there are actually two halves to
this problem: getting the child EMs to be generated in the first
place, and then getting them to be examined at the appropriate time.

So I tried out the logic described in this email and, with a few
modifications, it seemed to work. Updated patch attached, any review
appreciated. There are still a bunch of other things that need to be
fixed here, but I think this is OK as far as this particular issue is
concerned. I fixed a few other things:

- create_append_plan must call create_plan_recurse rather than
create_plan. This appears to be an error introduced by rebasing; the
previous version of the patch seg faults when attempting to execute a
plan wherein an inner index scan has been pushed down through an
append node
- remove the hack to short-circuit the append node altogether if
there's only one child
- some miscellaneous code cleanup (more is needed)

3. TGL: "Speaking of sorting, it's not entirely clear to me how the
patch ensures that all the child plans produce the necessary sort keys
as output columns, and especially not how it ensures that they all get
produced in the *same* output columns. This might accidentally manage
to work because of the "throwaway" call to make_sort_from_pathkeys(),
but at the very least that's a misleading comment." I'm not sure what
needs to be done about this; I'm going to look at this further.

I spent some time looking at this complaint, and I'm still not sure
what needs to be done about it. Experimentation reveals that the
neither the throwaway call to make_sort_from_pathkeys() nor any other
call to that function is creating the relevant target list entries.
It appears that they're getting created when set_append_rel_pathlist()
calls adjust_appendrel_attrs(). I'm not sure whether that's
sufficient or not. make_sort_from_pathkeys() could add any missing
entries later, but I'm not too sure the order would match if that
happened.

Another awkwardness of this patch is that it makes
create_append_path() and consequently set_dummy_rel_pathlist() take an
additional "root" argument. While there's nothing terribly
unreasonable about this on its face, it's only necessary so that
create_append_path() can call cost_sort(), which takes "root" but
doesn't actually use it. I'm not sure whether it's better to leave
this as-is or to remove the root argument from cost_sort().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachments:

merge_append_2010_09_29.patchapplication/octet-stream; name=merge_append_2010_09_29.patchDownload
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 4f3d899..94359bb 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,9 +59,25 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "utils/lsyscache.h"
+#include "access/nbtree.h"
+
+/* It gets quite confusing having a heap array (indexed by integers) which
+ * contains integers which index into the slots array. These typedefs try to
+ * clear it up but without making simple inline accessing functions they don't
+ * actually produce any warnings on mistakes */
+
+typedef int SlotNumber;
+typedef int HeapPosition;
+
+#define WHICHPLAN_PLANS_UNINITIALIZED (-1)
 
 static bool exec_append_initialize_next(AppendState *appendstate);
 
+static int heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2);
+
+static void heap_siftup_slot(AppendState *node);
+static void heap_insert_slot(AppendState *node, SlotNumber new_slot);
 
 /* ----------------------------------------------------------------
  *		exec_append_initialize_next
@@ -142,6 +158,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_is_ordered = node->isOrdered;
 
 	/*
 	 * Miscellaneous initialization
@@ -175,11 +192,53 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ExecAssignResultTypeFromTL(&appendstate->ps);
 	appendstate->ps.ps_ProjInfo = NULL;
 
-	/*
-	 * initialize to scan first subplan
-	 */
-	appendstate->as_whichplan = 0;
-	exec_append_initialize_next(appendstate);
+	if (!appendstate->as_is_ordered)
+	{
+		/*
+		 * initialize to scan first subplan
+		 */
+		appendstate->as_whichplan = 0;
+		exec_append_initialize_next(appendstate);
+	}
+	else
+	{
+		/* set up scan keys and initialize *all* the subnodes */
+		int i;
+
+		appendstate->as_nkeys = node->numCols;
+		appendstate->as_scankeys = palloc(sizeof(ScanKeyData) *  node->numCols);
+		appendstate->as_slots = palloc(sizeof(TupleTableSlot *) * nplans);
+		appendstate->as_heap = palloc(sizeof(int) * nplans);
+		appendstate->as_heap_size = 0;
+
+		for (i=0; i < nplans; i++)
+		{
+			appendstate->as_whichplan = i;
+			exec_append_initialize_next(appendstate);
+		}
+
+		appendstate->as_whichplan = WHICHPLAN_PLANS_UNINITIALIZED;
+
+		for (i = 0; i < node->numCols; i++)
+		{
+			Oid sortFunction;
+			bool reverse;
+
+			get_compare_function_for_ordering_op(node->sortOperators[i],
+												 &sortFunction, &reverse);
+
+			ScanKeyInit(&appendstate->as_scankeys[i],
+						node->sortColIdx[i],
+						InvalidStrategy,
+						sortFunction,
+						(Datum)0);
+
+			if (reverse)
+				appendstate->as_scankeys[i].sk_flags |= SK_BT_DESC;
+			if (node->nullsFirst[i])
+				appendstate->as_scankeys[i].sk_flags |= SK_BT_NULLS_FIRST;
+		}
+	}
 
 	return appendstate;
 }
@@ -193,45 +252,169 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	for (;;)
+	if (!node->as_is_ordered)
 	{
-		PlanState  *subnode;
-		TupleTableSlot *result;
+		for (;;)
+		{
+			PlanState  *subnode;
+			TupleTableSlot *result;
 
-		/*
-		 * figure out which subplan we are currently processing
-		 */
-		subnode = node->appendplans[node->as_whichplan];
+			/*
+			 * figure out which subplan we are currently processing
+			 */
+			subnode = node->appendplans[node->as_whichplan];
 
-		/*
-		 * get a tuple from the subplan
-		 */
-		result = ExecProcNode(subnode);
+			/*
+			 * get a tuple from the subplan
+			 */
+			result = ExecProcNode(subnode);
+
+			if (!TupIsNull(result))
+			{
+				/*
+				 * If the subplan gave us something then return it as-is. We do
+				 * NOT make use of the result slot that was set up in
+				 * ExecInitAppend; there's no need for it.
+				 */
+				return result;
+			}
 
-		if (!TupIsNull(result))
-		{
 			/*
-			 * If the subplan gave us something then return it as-is. We do
-			 * NOT make use of the result slot that was set up in
-			 * ExecInitAppend; there's no need for it.
+			 * Go on to the "next" subplan in the appropriate direction. If no
+			 * more subplans, return the empty slot set up for us by
+			 * ExecInitAppend.
 			 */
-			return result;
+			if (ScanDirectionIsForward(node->ps.state->es_direction))
+				node->as_whichplan++;
+			else
+				node->as_whichplan--;
+			if (!exec_append_initialize_next(node))
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+			/* Else loop back and try to get a tuple from the new subplan */
 		}
+	}
+	else
+	{
+		TupleTableSlot *result;
+		SlotNumber i;
 
-		/*
-		 * Go on to the "next" subplan in the appropriate direction. If no
-		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
-		 */
-		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
-		else
-			node->as_whichplan--;
-		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-
-		/* Else loop back and try to get a tuple from the new subplan */
+		if (node->as_whichplan == WHICHPLAN_PLANS_UNINITIALIZED)
+		{
+			for (i = 0; i < node->as_nplans; i++)
+			{
+				node->as_slots[i] = ExecProcNode(node->appendplans[i]);
+				if (!TupIsNull(node->as_slots[i]))
+					heap_insert_slot(node, i);
+			}
+                }
+                else
+                {
+                	i = node->as_whichplan;
+                	node->as_slots[i] = ExecProcNode(node->appendplans[i]);
+                	if (TupIsNull(node->as_slots[i]))
+                		;
+			else if (node->as_heap_size <= 0 || heap_compare_slots(node, node->as_heap[0], i) >= 0)
+				return node->as_slots[i];
+			else 
+				heap_insert_slot(node, i);
+                }
+
+                if (node->as_heap_size > 0)
+                {
+                	i = node->as_heap[0];
+                	node->as_whichplan = i;
+                	heap_siftup_slot(node);
+                	result = node->as_slots[i];
+		} else {
+			result = ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		return result;
+	}
+}
+
+static void
+heap_insert_slot(AppendState *node, SlotNumber new_slot)
+{
+	HeapPosition j;
+	SlotNumber *heap;
+
+	Assert(!TupIsNull(node->as_slots[new_slot]));
+
+	j = node->as_heap_size++;
+	heap = node->as_heap;
+
+	while (j > 0)
+	{
+		int i = (j-1)/2;
+		if (heap_compare_slots(node, new_slot, node->as_heap[i]) >= 0)
+			break;
+		heap[j] = heap[i];
+		j = i;
+	}
+	heap[j] = new_slot;
+}
+
+static void
+heap_siftup_slot(AppendState *node)
+{
+	HeapPosition i=0, j, n;
+
+	if (--node->as_heap_size <= 0)
+		return;
+	n = node->as_heap_size;
+
+	for (;;)
+	{
+		j = 2 * i + 1;
+		if (j >= n)
+			break;
+		if (j+1 < n && heap_compare_slots(node, node->as_heap[j], node->as_heap[j+1]) > 0)
+			j++;
+		if (heap_compare_slots(node, node->as_heap[n], node->as_heap[j]) <= 0)
+			break;
+		node->as_heap[i] = node->as_heap[j];
+		i = j;
+	}
+	node->as_heap[i] = node->as_heap[n];
+}
+
+static int
+heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2)
+{
+	int nkey;
+	TupleTableSlot *s1, *s2;
+
+	Assert(slot1 < node->as_nplans);
+	Assert(slot2 < node->as_nplans);
+
+	s1 = node->as_slots[slot1];
+	s2 = node->as_slots[slot2];
+	Assert(!TupIsNull(s1));
+	Assert(!TupIsNull(s2));
+
+	for (nkey = 0; nkey < node->as_nkeys; nkey++)
+	{
+		ScanKey scankey = node->as_scankeys + nkey;
+		AttrNumber attno = scankey->sk_attno;
+		Datum d1, d2;
+		bool  n1, n2;
+		int compare;
+
+		d1 = slot_getattr(s1, attno, &n1);
+		d2 = slot_getattr(s2, attno, &n2);
+
+		if (n1 && !n2)
+			return (scankey->sk_flags & SK_BT_NULLS_FIRST) ? -1 :  1;
+		if (!n1 && n2)
+			return (scankey->sk_flags & SK_BT_NULLS_FIRST) ?  1 : -1;
+
+		compare = FunctionCall2(&scankey->sk_func, d1, d2);
+		if (compare != 0)
+			return (scankey->sk_flags & SK_BT_DESC) ? -compare : compare;
 	}
+	return 0;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b7cf0b8..b8d5a80 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -51,7 +51,8 @@ static void set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					   RangeTblEntry *rte);
 static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte);
-static void set_dummy_rel_pathlist(RelOptInfo *rel);
+static List *accumulate_append_subpath(List *subpaths, Path *path);
+static void set_dummy_rel_pathlist(PlannerInfo *root, RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					  Index rti, RangeTblEntry *rte);
 static void set_function_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -222,7 +223,7 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	if (rel->reloptkind == RELOPT_BASEREL &&
 		relation_excluded_by_constraints(root, rel, rte))
 	{
-		set_dummy_rel_pathlist(rel);
+		set_dummy_rel_pathlist(root, rel);
 		return;
 	}
 
@@ -283,12 +284,13 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 						Index rti, RangeTblEntry *rte)
 {
 	int			parentRTindex = rti;
-	List	   *subpaths = NIL;
+	List	   *best_subpaths = NIL;	/* list of paths */
+	List	   *all_pathkeys = NIL;		/* union of all pathkeys in sub paths */
 	double		parent_rows;
 	double		parent_size;
 	double	   *parent_attrsizes;
 	int			nattrs;
-	ListCell   *l;
+	ListCell   *l, *ll, *lll;
 
 	/*
 	 * Initialize to compute size estimates for whole append relation.
@@ -366,7 +368,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 			 * Restriction reduces to constant FALSE or constant NULL after
 			 * substitution, so this child need not be scanned.
 			 */
-			set_dummy_rel_pathlist(childrel);
+			set_dummy_rel_pathlist(root, childrel);
 			continue;
 		}
 		childquals = make_ands_implicit((Expr *) childqual);
@@ -381,7 +383,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 			 * appendrel.  Mark it with a dummy cheapest-path though, in case
 			 * best_appendrel_indexscan() looks at it later.
 			 */
-			set_dummy_rel_pathlist(childrel);
+			set_dummy_rel_pathlist(root, childrel);
 			continue;
 		}
 
@@ -394,14 +396,22 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 								   appinfo);
 
 		/*
-		 * We have to make child entries in the EquivalenceClass data
-		 * structures as well.
+		 * If the parent has useful pathkeys, we can obtain the data for each
+		 * child in sorted order and then perform a final merge pass in the
+		 * Append node.  In order to construct a suitable sort or index scan
+		 * on a given child, the EquivalenceClass data structures for the
+		 * parent will need suitably-adjusted EquivalenceMember entries for
+		 * that child.  We will also need these child entries if we're trying
+		 * to "push down" an inner index-scan to each child rel.  However, a
+		 * separate test for that case isn't necessary here, because we only
+		 * care about pushing down eclass joins, and if there is such a join
+		 * then there will be useful pathkeys, because we could also handle the
+		 * same join as a merge join.
 		 */
-		if (rel->has_eclass_joins)
-		{
+		if (has_useful_pathkeys(root, rel))
 			add_child_rel_equivalences(root, appinfo, rel, childrel);
+		if (rel->has_eclass_joins)
 			childrel->has_eclass_joins = true;
-		}
 
 		/*
 		 * Note: we could compute appropriate attr_needed data for the child's
@@ -413,21 +423,34 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 		/*
 		 * Compute the child's access paths, and add the cheapest one to the
-		 * Append path we are constructing for the parent.
-		 *
-		 * It's possible that the child is itself an appendrel, in which case
-		 * we can "cut out the middleman" and just add its child paths to our
-		 * own list.  (We don't try to do this earlier because we need to
-		 * apply both levels of transformation to the quals.)
+ 		 * list of cheap paths (best_paths) and the pathkeys of all the paths
+ 		 * to the union, all_pathkeys.
 		 */
 		set_rel_pathlist(root, childrel, childRTindex, childRTE);
 
 		childpath = childrel->cheapest_total_path;
-		if (IsA(childpath, AppendPath))
-			subpaths = list_concat(subpaths,
-							list_copy(((AppendPath *) childpath)->subpaths));
-		else
-			subpaths = lappend(subpaths, childpath);
+ 		best_subpaths = accumulate_append_subpath(best_subpaths, childpath);
+
+		/* gather up all the pathkeys of all sub-paths
+		 * FIXME -- this is O(n^2)!
+		 */
+		foreach(ll, childrel->pathlist)
+		{
+			ListCell *lll;
+			Path *childpath = (Path *) lfirst(ll);
+			bool found = false;
+			
+			foreach (lll, all_pathkeys)
+			{
+				List *existing_pathkeys = (List *)lfirst(lll);
+ 
+				if (compare_pathkeys(existing_pathkeys,
+									 childpath->pathkeys) == PATHKEYS_EQUAL)
+					found = true;
+			}
+			if (!found)
+				all_pathkeys = lappend(all_pathkeys, childpath->pathkeys);
+		}
 
 		/*
 		 * Accumulate size information from each child.
@@ -483,14 +506,89 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	pfree(parent_attrsizes);
 
 	/*
-	 * Finally, build Append path and install it as the only access path for
-	 * the parent rel.	(Note: this is correct even if we have zero or one
-	 * live subpath due to constraint exclusion.)
+	 * First, build unordered Append path for the parent rel.	(Note: this is
+	 * correct even if we have zero or one live subpath due to constraint
+	 * exclusion.)
 	 */
-	add_path(rel, (Path *) create_append_path(rel, subpaths));
+	add_path(rel, (Path *) create_append_path(root, rel, best_subpaths, NIL));
 
-	/* Select cheapest path (pretty easy in this case...) */
-	set_cheapest(rel);
+	/*
+	 * Next, attempt to add ordered append paths based on the collected list
+	 * of child pathkeys.
+	 */
+	foreach (ll, all_pathkeys)
+	{
+		List *pathkeys = (List *)lfirst(ll);
+		List *startup_subpaths = NIL; /* subpaths for minimal startup cost append path */
+		List *total_subpaths = NIL;   /* subpaths for minimal total cost append path */
+
+		/* populate both subpaths lists based on this pathkeys list */
+		foreach(lll, root->append_rel_list)
+		{
+			AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lll);
+			RelOptInfo *childrel;
+			Path	   *cheapest_startup,
+					   *cheapest_total;
+
+			/* append_rel_list contains all append rels; ignore others */
+			if (appinfo->parent_relid != parentRTindex)
+				continue;
+
+			childrel = find_base_rel(root, appinfo->child_relid);
+			Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+
+			/* First the cheapest startup cost */
+			cheapest_startup =
+				get_cheapest_path_for_pathkeys(childrel->pathlist, 
+											   pathkeys,
+											   TOTAL_COST);
+			cheapest_total =
+				get_cheapest_path_for_pathkeys(childrel->pathlist, 
+											   pathkeys,
+											   STARTUP_COST);
+
+			/*
+			 * If we can't find any plans with the right order just add the
+			 * cheapest total plan to both paths; we'll have to sort it.
+			 */
+			if (cheapest_startup == NULL)
+			{
+				Assert(cheapest_total == NULL);
+				cheapest_startup = childrel->cheapest_total_path;
+				cheapest_total = childrel->cheapest_total_path;
+			}
+
+			/* Accumulate paths. */
+	 		total_subpaths =
+				accumulate_append_subpath(total_subpaths, cheapest_startup);
+	 		startup_subpaths =
+				accumulate_append_subpath(startup_subpaths, cheapest_total);
+		}
+
+		add_path(rel, (Path *) create_append_path(root, rel, startup_subpaths, pathkeys));
+		add_path(rel, (Path *) create_append_path(root, rel, total_subpaths, pathkeys));
+	}
+
+	/* Select cheapest path */
+  	set_cheapest(rel);
+}
+
+/*
+ * It's possible that the child is itself an appendrel, in which case
+ * we can "cut out the middleman" and just add its child paths to our
+ * own list.  (We don't try to do this earlier because we need to
+ * apply both levels of transformation to the quals.)
+ */
+static List *
+accumulate_append_subpath(List *subpaths, Path *path)
+{
+	if (IsA(path, AppendPath))
+	{
+		AppendPath *apath = (AppendPath *) path;
+		return list_concat(subpaths, list_copy(apath->subpaths));
+	}
+	else
+		return lappend(subpaths, path);
 }
 
 /*
@@ -501,13 +599,13 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
  * AppendPath with no members (see also IS_DUMMY_PATH macro).
  */
 static void
-set_dummy_rel_pathlist(RelOptInfo *rel)
+set_dummy_rel_pathlist(PlannerInfo *root, RelOptInfo *rel)
 {
 	/* Set dummy size estimates --- we leave attr_widths[] as zeroes */
 	rel->rows = 0;
 	rel->width = 0;
 
-	add_path(rel, (Path *) create_append_path(rel, NIL));
+	add_path(rel, (Path *) create_append_path(root, rel, NIL, NIL));
 
 	/* Select cheapest path (pretty easy in this case...) */
 	set_cheapest(rel);
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index a20ed5f..e44e960 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1611,8 +1611,8 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
  *	  Search for EC members that reference (only) the parent_rel, and
  *	  add transformed members referencing the child_rel.
  *
- * We only need to do this for ECs that could generate join conditions,
- * since the child members are only used for creating inner-indexscan paths.
+ * Note that this function won't be called at all unless we have at least some
+ * reason to believe that the EC members it generates will be useful.
  *
  * parent_rel and child_rel could be derived from appinfo, but since the
  * caller has already computed them, we might as well just pass them in.
@@ -1631,10 +1631,14 @@ add_child_rel_equivalences(PlannerInfo *root,
 		ListCell   *lc2;
 
 		/*
-		 * Won't generate joinclauses if const or single-member (the latter
-		 * test covers the volatile case too)
+		 * If this EC contains a constant, then it's not useful for sorting
+		 * or driving an inner index-scan, so we skip generating child EMs.
+		 *
+		 * If this EC contains a volatile expression, then generating child
+		 * EMs would be downright dangerous.  We rely on a volatile EC having
+		 * only one EM.
 		 */
-		if (cur_ec->ec_has_const || list_length(cur_ec->ec_members) <= 1)
+		if (cur_ec->ec_has_const || cur_ec->ec_has_volatile)
 			continue;
 
 		/* No point in searching if parent rel not mentioned in eclass */
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index ab3b9cd..a333b87 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -955,7 +955,7 @@ best_appendrel_indexscan(PlannerInfo *root, RelOptInfo *rel,
 		return NULL;
 
 	/* Form and return the completed Append path. */
-	return (Path *) create_append_path(rel, append_paths);
+	return (Path *) create_append_path(root, rel, append_paths, NIL);
 }
 
 /*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 197c49d..d9a2c7d 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -945,7 +945,7 @@ mark_dummy_rel(RelOptInfo *rel)
 	rel->pathlist = NIL;
 
 	/* Set up the dummy path */
-	add_path(rel, (Path *) create_append_path(rel, NIL));
+	add_path(rel, (Path *) create_append_path(NULL, rel, NIL, NIL));
 
 	/* Set or update cheapest_total_path */
 	set_cheapest(rel);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fa7b29f..23efbad 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -24,6 +24,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
+#include "optimizer/paths.h"
 #include "optimizer/plancat.h"
 #include "optimizer/planmain.h"
 #include "optimizer/predtest.h"
@@ -609,11 +610,19 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
+		Plan	   *subplan;
 
-		subplans = lappend(subplans, create_plan_recurse(root, subpath));
+		subplan = create_plan_recurse(root, subpath);
+		if (!pathkeys_contained_in(best_path->path.pathkeys, subpath->pathkeys))
+			subplan = (Plan *)make_sort_from_pathkeys(root,
+													  subplan,
+													  best_path->path.pathkeys,
+													  -1.0);
+
+		subplans = lappend(subplans, subplan);
 	}
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(root, subplans, tlist, best_path->path.pathkeys);
 
 	return (Plan *) plan;
 }
@@ -2784,7 +2793,8 @@ make_worktablescan(List *qptlist,
 }
 
 Append *
-make_append(List *appendplans, List *tlist)
+make_append(PlannerInfo *root, List *appendplans,
+				   List *tlist, List *pathkeys)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -2792,9 +2802,11 @@ make_append(List *appendplans, List *tlist)
 	ListCell   *subnode;
 
 	/*
-	 * Compute cost as sum of subplan costs.  We charge nothing extra for the
-	 * Append itself, which perhaps is too optimistic, but since it doesn't do
-	 * any selection or projection, it is a pretty cheap node.
+	 * Compute cost as sum of subplan costs. We charge nothing extra for a
+	 * plain Append itself, which perhaps is too optimistic, but since it
+	 * doesn't do any selection or projection, it is a pretty cheap node. In
+	 * the case of an ordered append we construct an equivalent bounded Sort
+	 * node and steal the cost calculations from it.
 	 */
 	plan->startup_cost = 0;
 	plan->total_cost = 0;
@@ -2804,8 +2816,10 @@ make_append(List *appendplans, List *tlist)
 	{
 		Plan	   *subplan = (Plan *) lfirst(subnode);
 
-		if (subnode == list_head(appendplans))	/* first node? */
-			plan->startup_cost = subplan->startup_cost;
+ 		/* If it's ordered then the startup cost is the sum of the startup
+ 		 * costs, otherwise it's the startup cost of just the first plan */
+ 		if (pathkeys || subnode == list_head(appendplans))
+ 			plan->startup_cost += subplan->startup_cost;
 		plan->total_cost += subplan->total_cost;
 		plan->plan_rows += subplan->plan_rows;
 		total_size += subplan->plan_width * subplan->plan_rows;
@@ -2821,6 +2835,30 @@ make_append(List *appendplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 
+	if (!pathkeys)
+		node->isOrdered = false;
+	else 
+	{
+		/* generate a throwaway sort node to find the sort functions */
+		Sort *tmp = make_sort_from_pathkeys(root, (Plan*)node, pathkeys, list_length(appendplans));
+		
+		node->isOrdered = true;
+		
+		node->numCols 		= tmp->numCols;
+		node->sortColIdx 	= tmp->sortColIdx;
+		node->sortOperators = tmp->sortOperators;
+		node->nullsFirst 	= tmp->nullsFirst;
+
+		/* a limited sort is the same kind of work (bounded heap sort) as an
+		 * ordered append with the bound set to the number of plans, so we just
+		 * use that to calculate the total cost. The startup cost is just the
+		 * sum of the startup costs of the nodes plus a bit to build the heap
+		 * (similar to processing that many rows in the bounded sort case).
+		 */
+		plan->total_cost = tmp->plan.total_cost - (enable_sort ? 0 : disable_cost);
+		plan->startup_cost += plan->total_cost / plan->plan_rows * list_length(appendplans);
+	}
+
 	return node;
 }
 
@@ -3182,7 +3220,12 @@ make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 			{
 				EquivalenceMember *em = (EquivalenceMember *) lfirst(j);
 
-				if (em->em_is_const || em->em_is_child)
+				/*
+				 * We shouldn't be trying to sort by an equivalence class that
+				 * contains a constant, so no need to consider such cases any
+				 * further.
+				 */
+				if (em->em_is_const)
 					continue;
 
 				tle = tlist_member((Node *) em->em_expr, tlist);
@@ -3218,8 +3261,14 @@ make_sort_from_pathkeys(PlannerInfo *root, Plan *lefttree, List *pathkeys,
 					List	   *exprvars;
 					ListCell   *k;
 
-					if (em->em_is_const || em->em_is_child)
+					/*
+					 * We shouldn't be trying to sort by an equivalence class
+					 * that contains a constant, so no need to consider such
+					 * cases any further.
+					 */
+					if (em->em_is_const)
 						continue;
+
 					sortexpr = em->em_expr;
 					exprvars = pull_var_clause((Node *) sortexpr,
 											   PVC_INCLUDE_PLACEHOLDERS);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f904258..b7039b1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -449,7 +449,7 @@ generate_union_plan(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	plan = (Plan *) make_append(planlist, tlist);
+	plan = (Plan *) make_append(root, planlist, tlist, NIL);
 
 	/*
 	 * For UNION ALL, we just need the Append plan.  For UNION, need to add
@@ -540,7 +540,7 @@ generate_nonunion_plan(SetOperationStmt *op, PlannerInfo *root,
 	/*
 	 * Append the child results together.
 	 */
-	plan = (Plan *) make_append(planlist, tlist);
+	plan = (Plan *) make_append(root, planlist, tlist, NIL);
 
 	/* Identify the grouping semantics */
 	groupList = generate_setop_grouplist(op, tlist);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f8aa745..d80e52c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -635,31 +635,74 @@ create_tidscan_path(PlannerInfo *root, RelOptInfo *rel, List *tidquals)
 /*
  * create_append_path
  *	  Creates a path corresponding to an Append plan, returning the
- *	  pathnode.
+ *	  pathnode. 
+ *
+ *    Note that we must support create_append_path(null, rel, nil, nil) which
+ *    is used to create dummy plans.
  */
 AppendPath *
-create_append_path(RelOptInfo *rel, List *subpaths)
+create_append_path(PlannerInfo *root, RelOptInfo *rel, List *subpaths, List *pathkeys)
 {
 	AppendPath *pathnode = makeNode(AppendPath);
 	ListCell   *l;
+	Cost total_cost=0, startup_cost=0;
+	Cost this_total_cost, this_startup_cost;
+	int npaths = 0;
 
 	pathnode->path.pathtype = T_Append;
 	pathnode->path.parent = rel;
-	pathnode->path.pathkeys = NIL;		/* result is always considered
-										 * unsorted */
+	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
 
-	pathnode->path.startup_cost = 0;
-	pathnode->path.total_cost = 0;
 	foreach(l, subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(l);
 
-		if (l == list_head(subpaths))	/* first node? */
-			pathnode->path.startup_cost = subpath->startup_cost;
-		pathnode->path.total_cost += subpath->total_cost;
+		if (pathkeys_contained_in(pathkeys, subpath->pathkeys))  
+		{
+			this_startup_cost = subpath->startup_cost;
+			this_total_cost   = subpath->total_cost;
+		}
+		else
+		{
+			Path sort_path; /* dummy for result of cost_sort */
+			cost_sort(&sort_path,
+					  root,
+					  pathkeys,
+					  subpath->total_cost, 
+					  subpath->parent->tuples, 
+					  subpath->parent->width, 
+					  -1.0);
+			this_total_cost   = sort_path.total_cost;
+			this_startup_cost = sort_path.startup_cost;
+		}
+
+		npaths++;
+		total_cost += this_total_cost;
+		/* If it's unsorted the startup cost is just the first subpath's
+		 * startup cost, otherwise it's the sum of all startup costs */
+		if (pathkeys != NIL || l == list_head(subpaths))
+			startup_cost += this_startup_cost;
+	}
+
+	/* The cost of merging using a heapsort */
+	if (pathkeys != NIL)
+	{
+		Path sort_path;
+		cost_sort(&sort_path,
+				  root,
+				  pathkeys,
+				  total_cost,
+				  rel->rows,
+				  rel->width, 
+				  npaths);
+		total_cost = sort_path.total_cost - (enable_sort ? 0 : disable_cost);
+		startup_cost += total_cost / rel->rows * npaths;
 	}
 
+	pathnode->path.total_cost = total_cost;
+	pathnode->path.startup_cost = startup_cost;
+	
 	return pathnode;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8a1f2ee..b274efc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1047,6 +1047,13 @@ typedef struct AppendState
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
 	int			as_whichplan;
+
+	bool		as_is_ordered;
+	int			as_nkeys;
+	ScanKey		as_scankeys;		/* array of length as_nkeys */
+	TupleTableSlot **as_slots;		/* array of length as_nplans */
+	int			*as_heap;	/* array of length as_nplans */
+	int			as_heap_size;	/* how many slots are present in the heap */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2f99f4..060d6a3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -180,6 +180,14 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	bool		isOrdered;
+
+	/* used only if the append is an ordered append */
+
+	int		numCols;	/* number of sort-key columns */
+	AttrNumber *sortColIdx;		/* their indexes in the target list */
+	Oid	   *sortOperators;	/* OIDs of operators to sort them by */
+	bool	   *nullsFirst;		/* NULLS FIRST/LAST directions */
 } Append;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 91f4c5c..11dd2ec 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -542,10 +542,11 @@ typedef struct EquivalenceClass
  *
  * em_is_child signifies that this element was built by transposing a member
  * for an inheritance parent relation to represent the corresponding expression
- * on an inheritance child.  The element should be ignored for all purposes
- * except constructing inner-indexscan paths for the child relation.  (Other
- * types of join are driven from transposed joininfo-list entries.)  Note
- * that the EC's ec_relids field does NOT include the child relation.
+ * on an inheritance child.  These elements are used for constructing
+ * inner-indexscan paths for the child relation (other types of join are
+ * driven from transposed joininfo-list entries) and for constructing sort
+ * ordered Append output.  Note that the EC's ec_relids field does NOT include
+ * the child relation.
  *
  * em_datatype is usually the same as exprType(em_expr), but can be
  * different when dealing with a binary-compatible opfamily; in particular
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 5e0ebe0..e5219d9 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -46,7 +46,10 @@ extern BitmapOrPath *create_bitmap_or_path(PlannerInfo *root,
 					  List *bitmapquals);
 extern TidPath *create_tidscan_path(PlannerInfo *root, RelOptInfo *rel,
 					List *tidquals);
-extern AppendPath *create_append_path(RelOptInfo *rel, List *subpaths);
+extern AppendPath *create_append_path(PlannerInfo *root,
+									  RelOptInfo *rel, 
+									  List *subpaths,
+									  List *pathkeys);
 extern ResultPath *create_result_path(List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5bb0e09..e11b1fd 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -43,7 +43,8 @@ extern Node *fix_indexqual_operand(Node *node, IndexOptInfo *index);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan,
 				  List *subrtable, List *subrowmark);
-extern Append *make_append(List *appendplans, List *tlist);
+extern Append *make_append(PlannerInfo *root, List *appendplans,
+					 List *tlist, List *pathkeys);
 extern RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree, Plan *righttree, int wtParam,
 					 List *distinctList, long numGroups);
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
Re: Path question

Robert Haas <robertmhaas@gmail.com> writes:

Another awkwardness of this patch is that it makes
create_append_path() and consequently set_dummy_rel_pathlist() take an
additional "root" argument. While there's nothing terribly
unreasonable about this on its face, it's only necessary so that
create_append_path() can call cost_sort(), which takes "root" but
doesn't actually use it. I'm not sure whether it's better to leave
this as-is or to remove the root argument from cost_sort().

Right offhand the cleanest answer to that seems to be to leave
create_append_path alone, and make a separate function named something
like create_ordered_append_path that handles the case where cost_sort
might be needed. I rather wonder if we don't want two separate
execution-time node types anyway, since what Append does seems
significantly different from Merge (and MergeAppend would be just a
misnomer).

I have to run off for a doctors appointment, will continue looking at
this patch when I get back.

regards, tom lane

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#23)
Re: Path question

I wrote:

I rather wonder if we don't want two separate
execution-time node types anyway, since what Append does seems
significantly different from Merge (and MergeAppend would be just a
misnomer).

I've been working on this patch, and have gotten the executor side split
out as a new node type. That adds something like 600 lines of
boilerplate code in various places, but I think it's well worthwhile to
get rid of the spaghetti-code effect of retail checks to see which kind
of Append we're dealing with. (Greg didn't catch all the places that
needed to distinguish, anyway.)

I did run into a problem with my plan to call the new node type "Merge":
the planner is already using "MergePath" as the name for the Path struct
that is precursor to a MergeJoin. For the moment I'm calling the new
node type MergeAppend, but as mentioned I feel that that's a bit of a
misnomer.

The only good alternative that I can think of is to rename MergePath to
MergeJoinPath (and then for consistency probably also HashPath to
HashJoinPath and NestPath to NestLoopPath). While that wouldn't touch
a huge number of files, it seems to carry some risk of breaking pending
patches, and anyway those are names that go back to Berkeley days so
people are used to 'em.

Anybody have a strong feeling about what to call these things?
At the moment I'm leaning to sticking with MergeAppend, but if we
decide to rename it it'd probably be better to do so before committing.

regards, tom lane

#25Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#24)
Re: Path question

On Thu, Oct 14, 2010 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I rather wonder if we don't want two separate
execution-time node types anyway, since what Append does seems
significantly different from Merge (and MergeAppend would be just a
misnomer).

I've been working on this patch, and have gotten the executor side split
out as a new node type.  That adds something like 600 lines of
boilerplate code in various places, but I think it's well worthwhile to
get rid of the spaghetti-code effect of retail checks to see which kind
of Append we're dealing with.  (Greg didn't catch all the places that
needed to distinguish, anyway.)

I did run into a problem with my plan to call the new node type "Merge":
the planner is already using "MergePath" as the name for the Path struct
that is precursor to a MergeJoin.  For the moment I'm calling the new
node type MergeAppend, but as mentioned I feel that that's a bit of a
misnomer.

The only good alternative that I can think of is to rename MergePath to
MergeJoinPath (and then for consistency probably also HashPath to
HashJoinPath and NestPath to NestLoopPath).  While that wouldn't touch
a huge number of files, it seems to carry some risk of breaking pending
patches, and anyway those are names that go back to Berkeley days so
people are used to 'em.

Anybody have a strong feeling about what to call these things?
At the moment I'm leaning to sticking with MergeAppend, but if we
decide to rename it it'd probably be better to do so before committing.

I don't like the idea of renaming the join nodes. Both the code churn
and the possibility of confusing long-time users seem undesirable.

Let's just stick with MergeAppend.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
Re: Path question

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Oct 14, 2010 at 11:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anybody have a strong feeling about what to call these things?
At the moment I'm leaning to sticking with MergeAppend, but if we
decide to rename it it'd probably be better to do so before committing.

I don't like the idea of renaming the join nodes. Both the code churn
and the possibility of confusing long-time users seem undesirable.

Yeah, especially if MergePath would still be there but now meaning
something different.

The other possible line of attack is to call the new node type something
else than either Merge or MergeAppend. Robert and I batted around a few
thoughts off-list last night, but none of them seemed any better than
MergeAppend.

regards, tom lane

#27Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#24)
Re: Path question

On Thu, Oct 14, 2010 at 8:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I did run into a problem with my plan to call the new node type "Merge":
the planner is already using "MergePath" as the name for the Path struct
that is precursor to a MergeJoin.  For the moment I'm calling the new
node type MergeAppend, but as mentioned I feel that that's a bit of a
misnomer.

On the plus side the resulting node does have a lot in common with
Append nodes and a lot of places that do something special with Append
nodes will have to do the same things with the new node, so having a
similar name might help people remember that when they're adding their
special code for Append.

At the time I went back and forth on whether to have a separate node.
I tried to do it and had the impression that there were a lot more
places that would need to treat the two similarly than places that
needed to treat the two differently. I'm curious to see how you do it
cleanly.

The only other name I batted around at the time was OrderedAppend
which only alters the other half of the name, so no real help.

--
greg

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
Re: Path question

Robert Haas <robertmhaas@gmail.com> writes:

So I tried out the logic described in this email and, with a few
modifications, it seemed to work. Updated patch attached, any review
appreciated.

Applied with revisions.

3. TGL: "Speaking of sorting, it's not entirely clear to me how the
patch ensures that all the child plans produce the necessary sort keys
as output columns, and especially not how it ensures that they all get
produced in the *same* output columns. This might accidentally manage
to work because of the "throwaway" call to make_sort_from_pathkeys(),
but at the very least that's a misleading comment." I'm not sure what
needs to be done about this; I'm going to look at this further.

I spent some time looking at this complaint, and I'm still not sure
what needs to be done about it. Experimentation reveals that the
neither the throwaway call to make_sort_from_pathkeys() nor any other
call to that function is creating the relevant target list entries.

Yeah, this was in fact pretty broken.  make_sort_from_pathkeys *does*
create the relevant tlist entries, but they were getting lost again
because they were attached to a Result node that went into the bit
bucket along with the "throwaway" Sort node.  Much worse, it was hit or
miss whether the tlist entries would be present in the child plan nodes
--- as coded, this would be the case only for children that were
explicitly sorted to meet the merge's pathkey requirements.  If they
weren't there, any sorting on resjunk values would misbehave.  I was
able to develop a test case that exposed this by using expression
indexes, so that went into the regression tests.

I fixed this by refactoring make_sort_from_pathkeys so that the tlist
fixup logic was separately accessible.

There were assorted other bugs too. I think it's all good now, but
we'll find out when beta starts ...

regards, tom lane