Propagate pathkeys from CTEs up to the outer query

Started by Richard Guoalmost 2 years ago7 messages

guofenglinux@gmail.com

almost 2 years ago

1 attachment(s)

In [1]/messages/by-id/482957.1700192299@sss.pgh.pa.us we've reached a conclusion that for a MATERIALIZED CTE it's okay
to 'allow our statistics or guesses for the sub-query to subsequently
influence what the upper planner does'. Commit f7816aec23 exposes
column statistics to the upper planner. In the light of that, here is a
patch that exposes info about the ordering of the CTE result to the
upper planner.

This patch was initially posted in that same thread and has received
some comments from Tom in [2]/messages/by-id/1339607.1700502358@sss.pgh.pa.us. Due to the presence of multiple patches
in that thread, it has led to confusion. So fork a new thread here
specifically dedicated to discussing the patch about exposing pathkeys
from CTEs to the upper planner.

Comments/thoughts?

[1]: /messages/by-id/482957.1700192299@sss.pgh.pa.us
[2]: /messages/by-id/1339607.1700502358@sss.pgh.pa.us

Thanks
Richard

Attachments:

v3-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patchapplication/octet-stream; name=v3-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patchDownload

From 6e33c6aa030a88ae4d0eacd73bd8a238ac416aef Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Mon, 20 Nov 2023 10:08:28 +0800
Subject: [PATCH v3] Propagate pathkeys from CTEs up to the outer query

---
 src/backend/optimizer/path/allpaths.c     | 17 ++++++++++++++++-
 src/backend/optimizer/plan/planner.c      |  1 +
 src/backend/optimizer/plan/subselect.c    | 16 ++++++++++------
 src/backend/optimizer/prep/prepjointree.c |  1 +
 src/backend/optimizer/util/pathnode.c     |  5 +++--
 src/include/nodes/pathnodes.h             |  5 +++++
 src/include/optimizer/pathnode.h          |  2 +-
 src/test/regress/expected/with.out        | 17 +++++++++++++++++
 src/test/regress/sql/with.sql             |  7 +++++++
 9 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 84c4de488a..d4791d128e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2882,6 +2882,8 @@ set_cte_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	ListCell   *lc;
 	int			plan_id;
 	Relids		required_outer;
+	Path	   *ctepath;
+	List	   *pathkeys;
 
 	/*
 	 * Find the referenced CTE, and locate the plan previously made for it.
@@ -2921,6 +2923,19 @@ set_cte_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Mark rel with estimated output rows, width, etc */
 	set_cte_size_estimates(root, rel, cteplan->plan_rows);
 
+	/*
+	 * Locate the best path previously made for the referenced CTE.  We need to
+	 * know its pathkeys.
+	 */
+	Assert(list_length(cteroot->cte_plan_ids) == list_length(cteroot->cte_paths));
+	ctepath = (Path *) list_nth(cteroot->cte_paths, ndx);
+
+	/* Convert the pathkeys to outer representation */
+	pathkeys = convert_subquery_pathkeys(root,
+										 rel,
+										 ctepath->pathkeys,
+										 cteplan->targetlist);
+
 	/*
 	 * We don't support pushing join clauses into the quals of a CTE scan, but
 	 * it could still have required parameterization due to LATERAL refs in
@@ -2929,7 +2944,7 @@ set_cte_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	required_outer = rel->lateral_relids;
 
 	/* Generate appropriate path */
-	add_path(rel, create_ctescan_path(root, rel, required_outer));
+	add_path(rel, create_ctescan_path(root, rel, pathkeys, required_outer));
 }
 
 /*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2e2458b128..b9a2dbb88c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -643,6 +643,7 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	root->planner_cxt = CurrentMemoryContext;
 	root->init_plans = NIL;
 	root->cte_plan_ids = NIL;
+	root->cte_paths = NIL;
 	root->multiexpr_params = NIL;
 	root->join_domains = NIL;
 	root->eq_classes = NIL;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3115d79ad9..ca73694da6 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -885,16 +885,17 @@ hash_ok_operator(OpExpr *expr)
  * unreferenced SELECT), "inline" it to create a regular sub-SELECT-in-FROM,
  * or convert it to an initplan.
  *
- * A side effect is to fill in root->cte_plan_ids with a list that
- * parallels root->parse->cteList and provides the subplan ID for
- * each CTE's initplan, or a dummy ID (-1) if we didn't make an initplan.
+ * A side effect is to fill in root->cte_plan_ids and root->cte_paths with
+ * lists that parallel root->parse->cteList and provide the subplan ID and
+ * best path for each CTE, or a dummy ID (-1) and a dummy Path (NULL) if we
+ * didn't make a subplan.
  */
 void
 SS_process_ctes(PlannerInfo *root)
 {
 	ListCell   *lc;
 
-	Assert(root->cte_plan_ids == NIL);
+	Assert(root->cte_plan_ids == NIL && root->cte_paths == NIL);
 
 	foreach(lc, root->parse->cteList)
 	{
@@ -913,8 +914,9 @@ SS_process_ctes(PlannerInfo *root)
 		 */
 		if (cte->cterefcount == 0 && cmdType == CMD_SELECT)
 		{
-			/* Make a dummy entry in cte_plan_ids */
+			/* Make a dummy entry in cte_plan_ids and cte_paths */
 			root->cte_plan_ids = lappend_int(root->cte_plan_ids, -1);
+			root->cte_paths = lappend(root->cte_paths, NULL);
 			continue;
 		}
 
@@ -960,8 +962,9 @@ SS_process_ctes(PlannerInfo *root)
 			!contain_volatile_functions(cte->ctequery))
 		{
 			inline_cte(root, cte);
-			/* Make a dummy entry in cte_plan_ids */
+			/* Make a dummy entry in cte_plan_ids and cte_paths */
 			root->cte_plan_ids = lappend_int(root->cte_plan_ids, -1);
+			root->cte_paths = lappend(root->cte_paths, NULL);
 			continue;
 		}
 
@@ -1051,6 +1054,7 @@ SS_process_ctes(PlannerInfo *root)
 		root->init_plans = lappend(root->init_plans, splan);
 
 		root->cte_plan_ids = lappend_int(root->cte_plan_ids, splan->plan_id);
+		root->cte_paths = lappend(root->cte_paths, best_path);
 
 		/* Label the subplan for EXPLAIN purposes */
 		splan->plan_name = psprintf("CTE %s", cte->ctename);
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index aa83dd3636..a572c22556 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -990,6 +990,7 @@ pull_up_simple_subquery(PlannerInfo *root, Node *jtnode, RangeTblEntry *rte,
 	subroot->planner_cxt = CurrentMemoryContext;
 	subroot->init_plans = NIL;
 	subroot->cte_plan_ids = NIL;
+	subroot->cte_paths = NIL;
 	subroot->multiexpr_params = NIL;
 	subroot->join_domains = NIL;
 	subroot->eq_classes = NIL;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8dbf790e89..b23422f9aa 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2113,7 +2113,8 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
  *	  returning the pathnode.
  */
 Path *
-create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+create_ctescan_path(PlannerInfo *root, RelOptInfo *rel,
+					List *pathkeys, Relids required_outer)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -2125,7 +2126,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 	pathnode->parallel_aware = false;
 	pathnode->parallel_safe = rel->consider_parallel;
 	pathnode->parallel_workers = 0;
-	pathnode->pathkeys = NIL;	/* XXX for now, result is always unordered */
+	pathnode->pathkeys = pathkeys;
 
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 534692bee1..5fdb136557 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -301,6 +301,11 @@ struct PlannerInfo
 	 */
 	List	   *cte_plan_ids;
 
+	/*
+	 * per-CTE-item list of Paths (or NULL if no Path was made for that CTE)
+	 */
+	List	   *cte_paths;
+
 	/* List of Lists of Params for MULTIEXPR subquery outputs */
 	List	   *multiexpr_params;
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index c43d97b48a..837bf6ffdc 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -115,7 +115,7 @@ extern Path *create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
 extern Path *create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
 									   Relids required_outer);
 extern Path *create_ctescan_path(PlannerInfo *root, RelOptInfo *rel,
-								 Relids required_outer);
+								 List *pathkeys, Relids required_outer);
 extern Path *create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
 											 Relids required_outer);
 extern Path *create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/test/regress/expected/with.out b/src/test/regress/expected/with.out
index 7d796ea69c..bbb77e6bbb 100644
--- a/src/test/regress/expected/with.out
+++ b/src/test/regress/expected/with.out
@@ -672,6 +672,23 @@ select count(*) from tenk1 a
                Index Cond: (unique1 = x.unique1)
 (10 rows)
 
+-- test that pathkeys from a materialized CTE are propagated up to the
+-- outer query
+explain (costs off)
+with x as materialized (select unique1 from tenk1 b order by unique1)
+select count(*) from tenk1 a
+  where unique1 in (select * from x);
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Aggregate
+   CTE x
+     ->  Index Only Scan using tenk1_unique1 on tenk1 b
+   ->  Merge Semi Join
+         Merge Cond: (a.unique1 = x.unique1)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 a
+         ->  CTE Scan on x
+(7 rows)
+
 -- SEARCH clause
 create temp table graph0( f int, t int, label text );
 insert into graph0 values
diff --git a/src/test/regress/sql/with.sql b/src/test/regress/sql/with.sql
index f8a213e357..3b904d89b7 100644
--- a/src/test/regress/sql/with.sql
+++ b/src/test/regress/sql/with.sql
@@ -359,6 +359,13 @@ with x as materialized (insert into tenk1 default values returning unique1)
 select count(*) from tenk1 a
   where unique1 in (select * from x);
 
+-- test that pathkeys from a materialized CTE are propagated up to the
+-- outer query
+explain (costs off)
+with x as materialized (select unique1 from tenk1 b order by unique1)
+select count(*) from tenk1 a
+  where unique1 in (select * from x);
+
 -- SEARCH clause
 
 create temp table graph0( f int, t int, label text );
-- 
2.31.0

Andrei Lepikhov

a.lepikhov@postgrespro.ru

almost 2 years ago

In reply to: Richard Guo (#1)

Re: Propagate pathkeys from CTEs up to the outer query

On 29/1/2024 10:18, Richard Guo wrote:

In [1] we've reached a conclusion that for a MATERIALIZED CTE it's okay
to 'allow our statistics or guesses for the sub-query to subsequently
influence what the upper planner does'. Commit f7816aec23 exposes
column statistics to the upper planner. In the light of that, here is a
patch that exposes info about the ordering of the CTE result to the
upper planner.

This patch was initially posted in that same thread and has received
some comments from Tom in [2]. Due to the presence of multiple patches
in that thread, it has led to confusion. So fork a new thread here
specifically dedicated to discussing the patch about exposing pathkeys
from CTEs to the upper planner.

I like this approach. It looks good initially, but such features need
more opinions/views/time to analyse corner cases.
It goes alongside my current backburner - pull parameterisation through
the GROUP-BY and the query block fence up to the JOIN searching code of
the parent query.

--
regards,
Andrei Lepikhov
Postgres Professional

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Richard Guo (#1)

Re: Propagate pathkeys from CTEs up to the outer query

Richard Guo <guofenglinux@gmail.com> writes:

This patch was initially posted in that same thread and has received
some comments from Tom in [2]. Due to the presence of multiple patches
in that thread, it has led to confusion. So fork a new thread here
specifically dedicated to discussing the patch about exposing pathkeys
from CTEs to the upper planner.

I got around to looking at this finally. I was a bit surprised by
your choice of data structure. You made a per-CTE-item cte_paths
list paralleling cte_plan_ids, but what I had had in mind was a
per-subplan list of paths paralleling glob->subplans and subroots.
This would mean that the code for ordinary SubqueryScans would
also need to fill in that list, but surely that's a trivial cost
compared to everything else we do to prepare a subplan. I don't
think that we have any immediate need to remember that info for
an ordinary SubqueryScan, but it seems plausible that we will
in future. Also, I'm not sure that a Path is fully interpretable
without the associated PlannerInfo (subroot), so keeping it
beside the list of subroots seems more future-proof than dissociating
it from that. This approach would also be more amenable to postponing
creation of the subplans, as we speculated about earlier. (I have
no near-term desire to actually do that, but maybe someday it will
happen.)

regards, tom lane

Richard Guo

guofenglinux@gmail.com

almost 2 years ago

In reply to: Tom Lane (#3)

1 attachment(s)

Re: Propagate pathkeys from CTEs up to the outer query

On Tue, Mar 26, 2024 at 1:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I got around to looking at this finally. I was a bit surprised by
your choice of data structure. You made a per-CTE-item cte_paths
list paralleling cte_plan_ids, but what I had had in mind was a
per-subplan list of paths paralleling glob->subplans and subroots.
This would mean that the code for ordinary SubqueryScans would
also need to fill in that list, but surely that's a trivial cost
compared to everything else we do to prepare a subplan. I don't
think that we have any immediate need to remember that info for
an ordinary SubqueryScan, but it seems plausible that we will
in future. Also, I'm not sure that a Path is fully interpretable
without the associated PlannerInfo (subroot), so keeping it
beside the list of subroots seems more future-proof than dissociating
it from that. This approach would also be more amenable to postponing
creation of the subplans, as we speculated about earlier. (I have
no near-term desire to actually do that, but maybe someday it will
happen.)

I agree with your points. Previously I was thinking that CTEs were the
only scenario where we needed to remember the best path and only
required the best path's pathkeys. However, considering potential
future use cases as you mentioned, I concur that having a per-subplan
list of paths would be more future-proof. Please see attached v4 patch.

Thanks
Richard

Attachments:

v4-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patchapplication/octet-stream; name=v4-0001-Propagate-pathkeys-from-CTEs-up-to-the-outer-query.patchDownload

From b53c42fcc9b3608623a22d4366a8472321e2ac39 Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Mon, 20 Nov 2023 10:08:28 +0800
Subject: [PATCH v4] Propagate pathkeys from CTEs up to the outer query

---
 src/backend/optimizer/path/allpaths.c  | 17 ++++++++++++++++-
 src/backend/optimizer/plan/planner.c   |  1 +
 src/backend/optimizer/plan/subselect.c | 12 +++++++++---
 src/backend/optimizer/util/pathnode.c  |  5 +++--
 src/include/nodes/pathnodes.h          |  3 +++
 src/include/optimizer/pathnode.h       |  2 +-
 src/test/regress/expected/with.out     | 17 +++++++++++++++++
 src/test/regress/sql/with.sql          |  7 +++++++
 8 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0b98f0856e..e87f37d219 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2873,6 +2873,8 @@ static void
 set_cte_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 {
 	Plan	   *cteplan;
+	Path	   *ctepath;
+	List	   *pathkeys;
 	PlannerInfo *cteroot;
 	Index		levelsup;
 	int			ndx;
@@ -2918,6 +2920,19 @@ set_cte_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Mark rel with estimated output rows, width, etc */
 	set_cte_size_estimates(root, rel, cteplan->plan_rows);
 
+	/*
+	 * Locate the best path previously made for the referenced CTE.  We need to
+	 * know its pathkeys.
+	 */
+	Assert(list_length(root->glob->subpaths) == list_length(root->glob->subplans));
+	ctepath = (Path *) list_nth(root->glob->subpaths, plan_id - 1);
+
+	/* Convert the pathkeys to outer representation */
+	pathkeys = convert_subquery_pathkeys(root,
+										 rel,
+										 ctepath->pathkeys,
+										 cteplan->targetlist);
+
 	/*
 	 * We don't support pushing join clauses into the quals of a CTE scan, but
 	 * it could still have required parameterization due to LATERAL refs in
@@ -2926,7 +2941,7 @@ set_cte_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	required_outer = rel->lateral_relids;
 
 	/* Generate appropriate path */
-	add_path(rel, create_ctescan_path(root, rel, required_outer));
+	add_path(rel, create_ctescan_path(root, rel, pathkeys, required_outer));
 }
 
 /*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6d08cc8cdd..3b60a8c5be 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -307,6 +307,7 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	glob->boundParams = boundParams;
 	glob->subplans = NIL;
 	glob->subroots = NIL;
+	glob->subpaths = NIL;
 	glob->rewindPlanIDs = NULL;
 	glob->finalrtable = NIL;
 	glob->finalrteperminfos = NIL;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d6954a7e86..27ad3fc2f8 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -539,10 +539,12 @@ build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot,
 	}
 
 	/*
-	 * Add the subplan and its PlannerInfo to the global lists.
+	 * Add the subplan and its PlannerInfo, as well as a dummy Path entry, to
+	 * the global lists.
 	 */
 	root->glob->subplans = lappend(root->glob->subplans, plan);
 	root->glob->subroots = lappend(root->glob->subroots, subroot);
+	root->glob->subpaths = lappend(root->glob->subpaths, NULL);
 	splan->plan_id = list_length(root->glob->subplans);
 
 	if (isInitPlan)
@@ -1029,10 +1031,12 @@ SS_process_ctes(PlannerInfo *root)
 		splan->setParam = list_make1_int(paramid);
 
 		/*
-		 * Add the subplan and its PlannerInfo to the global lists.
+		 * Add the subplan and its PlannerInfo, as well as the best path, to
+		 * the global lists.
 		 */
 		root->glob->subplans = lappend(root->glob->subplans, plan);
 		root->glob->subroots = lappend(root->glob->subroots, subroot);
+		root->glob->subpaths = lappend(root->glob->subpaths, best_path);
 		splan->plan_id = list_length(root->glob->subplans);
 
 		root->init_plans = lappend(root->init_plans, splan);
@@ -3021,10 +3025,12 @@ SS_make_initplan_from_plan(PlannerInfo *root,
 	SubPlan    *node;
 
 	/*
-	 * Add the subplan and its PlannerInfo to the global lists.
+	 * Add the subplan and its PlannerInfo, as well as a dummy Path entry, to
+	 * the global lists.
 	 */
 	root->glob->subplans = lappend(root->glob->subplans, plan);
 	root->glob->subroots = lappend(root->glob->subroots, subroot);
+	root->glob->subpaths = lappend(root->glob->subpaths, NULL);
 
 	/*
 	 * Create a SubPlan node and add it to the outer list of InitPlans. Note
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c29ca5a0da..e4a5ae83e4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2121,7 +2121,8 @@ create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
  *	  returning the pathnode.
  */
 Path *
-create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+create_ctescan_path(PlannerInfo *root, RelOptInfo *rel,
+					List *pathkeys, Relids required_outer)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -2133,7 +2134,7 @@ create_ctescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 	pathnode->parallel_aware = false;
 	pathnode->parallel_safe = rel->consider_parallel;
 	pathnode->parallel_workers = 0;
-	pathnode->pathkeys = NIL;	/* XXX for now, result is always unordered */
+	pathnode->pathkeys = pathkeys;
 
 	cost_ctescan(pathnode, root, rel, pathnode->param_info);
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 08dbee002f..6611b1e0aa 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -107,6 +107,9 @@ typedef struct PlannerGlobal
 	/* PlannerInfos for SubPlan nodes */
 	List	   *subroots pg_node_attr(read_write_ignore);
 
+	/* best Paths for SubPlan nodes */
+	List	   *subpaths;
+
 	/* indices of subplans that require REWIND */
 	Bitmapset  *rewindPlanIDs;
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 99c2f955aa..7cc481b8c2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -115,7 +115,7 @@ extern Path *create_valuesscan_path(PlannerInfo *root, RelOptInfo *rel,
 extern Path *create_tablefuncscan_path(PlannerInfo *root, RelOptInfo *rel,
 									   Relids required_outer);
 extern Path *create_ctescan_path(PlannerInfo *root, RelOptInfo *rel,
-								 Relids required_outer);
+								 List *pathkeys, Relids required_outer);
 extern Path *create_namedtuplestorescan_path(PlannerInfo *root, RelOptInfo *rel,
 											 Relids required_outer);
 extern Path *create_resultscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/test/regress/expected/with.out b/src/test/regress/expected/with.out
index 7bef181d78..b4f3121751 100644
--- a/src/test/regress/expected/with.out
+++ b/src/test/regress/expected/with.out
@@ -672,6 +672,23 @@ select count(*) from tenk1 a
                Index Cond: (unique1 = x.unique1)
 (10 rows)
 
+-- test that pathkeys from a materialized CTE are propagated up to the
+-- outer query
+explain (costs off)
+with x as materialized (select unique1 from tenk1 b order by unique1)
+select count(*) from tenk1 a
+  where unique1 in (select * from x);
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Aggregate
+   CTE x
+     ->  Index Only Scan using tenk1_unique1 on tenk1 b
+   ->  Merge Semi Join
+         Merge Cond: (a.unique1 = x.unique1)
+         ->  Index Only Scan using tenk1_unique1 on tenk1 a
+         ->  CTE Scan on x
+(7 rows)
+
 -- SEARCH clause
 create temp table graph0( f int, t int, label text );
 insert into graph0 values
diff --git a/src/test/regress/sql/with.sql b/src/test/regress/sql/with.sql
index 037bc0a511..3fb0b33fce 100644
--- a/src/test/regress/sql/with.sql
+++ b/src/test/regress/sql/with.sql
@@ -359,6 +359,13 @@ with x as materialized (insert into tenk1 default values returning unique1)
 select count(*) from tenk1 a
   where unique1 in (select * from x);
 
+-- test that pathkeys from a materialized CTE are propagated up to the
+-- outer query
+explain (costs off)
+with x as materialized (select unique1 from tenk1 b order by unique1)
+select count(*) from tenk1 a
+  where unique1 in (select * from x);
+
 -- SEARCH clause
 
 create temp table graph0( f int, t int, label text );
-- 
2.31.0

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Richard Guo (#4)

Re: Propagate pathkeys from CTEs up to the outer query

Richard Guo <guofenglinux@gmail.com> writes:

I agree with your points. Previously I was thinking that CTEs were the
only scenario where we needed to remember the best path and only
required the best path's pathkeys. However, considering potential
future use cases as you mentioned, I concur that having a per-subplan
list of paths would be more future-proof. Please see attached v4 patch.

Hm, well, you didn't actually fill in the paths for the other
subqueries. I agree that it's not worth doing so in
SS_make_initplan_from_plan, but a comment explaining that decision
seems in order. Also, there's nothing stopping us from saving the
path for subplans made in build_subplan, except adding a parameter
to pass them down. So I did that, and made a couple other cosmetic
changes, and pushed it.

One thing I noticed while testing is that while your regression-test
query successfully propagates the pathkeys:

regression=# explain with x as materialized (select unique1 from tenk1 b order by unique1)
select count(*) from tenk1 a
where unique1 in (select * from x);
QUERY PLAN
----------------------------------------------------------------------------------------------------
Aggregate (cost=915.55..915.56 rows=1 width=8)
CTE x
-> Index Only Scan using tenk1_unique1 on tenk1 b (cost=0.29..270.29 rows=10000 width=4)
-> Merge Semi Join (cost=0.31..620.26 rows=10000 width=0)
Merge Cond: (a.unique1 = x.unique1)
-> Index Only Scan using tenk1_unique1 on tenk1 a (cost=0.29..270.29 rows=10000 width=4)
-> CTE Scan on x (cost=0.00..200.00 rows=10000 width=4)
(7 rows)

this variant doesn't:

regression=# explain with x as materialized (select unique1 from tenk1 b)
select count(*) from tenk1 a
where unique1 in (select * from x);
QUERY PLAN
----------------------------------------------------------------------------------------------------
Aggregate (cost=1028.07..1028.08 rows=1 width=8)
CTE x
-> Index Only Scan using tenk1_unique1 on tenk1 b (cost=0.29..270.29 rows=10000 width=4)
-> Hash Semi Join (cost=325.28..732.78 rows=10000 width=0)
Hash Cond: (a.unique1 = x.unique1)
-> Index Only Scan using tenk1_unique1 on tenk1 a (cost=0.29..270.29 rows=10000 width=4)
-> Hash (cost=200.00..200.00 rows=10000 width=4)
-> CTE Scan on x (cost=0.00..200.00 rows=10000 width=4)
(8 rows)

That's not the fault of anything we did here; the IndexOnlyScan path
in the subquery is in fact not marked with any pathkeys, even though
clearly its result is sorted. I believe that's an intentional
decision from way way back, that pathkeys only correspond to orderings
that are of interest in the current query level. "select unique1 from
tenk1 b order by unique1" has an interest in ordering by unique1,
but "select unique1 from tenk1 b" does not, so it's choosing that
path strictly according to cost. Not generating pathkeys in such a
query saves a few cycles and ensures that we won't improperly prefer
a path on the basis of pathkeys if it hasn't got a cost advantage.
So I'm quite hesitant to muck with that old decision, especially in
the waning days of a development cycle, but the results do feel a
little strange here.

regards, tom lane

Richard Guo

guofenglinux@gmail.com

almost 2 years ago

In reply to: Tom Lane (#5)

Re: Propagate pathkeys from CTEs up to the outer query

On Wed, Mar 27, 2024 at 1:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

I agree with your points. Previously I was thinking that CTEs were the
only scenario where we needed to remember the best path and only
required the best path's pathkeys. However, considering potential
future use cases as you mentioned, I concur that having a per-subplan
list of paths would be more future-proof. Please see attached v4 patch.

Hm, well, you didn't actually fill in the paths for the other
subqueries. I agree that it's not worth doing so in
SS_make_initplan_from_plan, but a comment explaining that decision
seems in order. Also, there's nothing stopping us from saving the
path for subplans made in build_subplan, except adding a parameter
to pass them down. So I did that, and made a couple other cosmetic
changes, and pushed it.

Thanks for the adjustments and pushing!

That's not the fault of anything we did here; the IndexOnlyScan path
in the subquery is in fact not marked with any pathkeys, even though
clearly its result is sorted. I believe that's an intentional
decision from way way back, that pathkeys only correspond to orderings
that are of interest in the current query level. "select unique1 from
tenk1 b order by unique1" has an interest in ordering by unique1,
but "select unique1 from tenk1 b" does not, so it's choosing that
path strictly according to cost. Not generating pathkeys in such a
query saves a few cycles and ensures that we won't improperly prefer
a path on the basis of pathkeys if it hasn't got a cost advantage.
So I'm quite hesitant to muck with that old decision, especially in
the waning days of a development cycle, but the results do feel a
little strange here.

Yeah, I also noticed this while writing the test case. That's why I
added 'order by unique1' explicitly in the CTE subquery. This also
happens to subquery RTEs, such as

explain (costs off)
select * from (select unique1 from tenk1 offset 0) order by unique1;
QUERY PLAN
----------------------------------------------------
Sort
Sort Key: tenk1.unique1
-> Index Only Scan using tenk1_unique1 on tenk1
(3 rows)

I agree that mucking with the old decision might not be a good idea. In
addition, for a MATERIALIZED CTE, generating pathkeys according to the
outer query's ordering requirements breaks the idea of optimization
fence: the outer query should not affect the plan chosen for the CTE
query.

Thanks
Richard

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Tom Lane (#5)

Re: Propagate pathkeys from CTEs up to the outer query

On Wed, 27 Mar 2024 at 06:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's not the fault of anything we did here; the IndexOnlyScan path
in the subquery is in fact not marked with any pathkeys, even though
clearly its result is sorted. I believe that's an intentional
decision from way way back, that pathkeys only correspond to orderings
that are of interest in the current query level. "select unique1 from
tenk1 b order by unique1" has an interest in ordering by unique1,
but "select unique1 from tenk1 b" does not, so it's choosing that
path strictly according to cost. Not generating pathkeys in such a
query saves a few cycles and ensures that we won't improperly prefer
a path on the basis of pathkeys if it hasn't got a cost advantage.
So I'm quite hesitant to muck with that old decision, especially in
the waning days of a development cycle, but the results do feel a
little strange here.

I agree that add_path() might start making questionable choices if we
were to include unrelated pathkeys in a path. However, I don't think
it would be a bad idea to give a subquery a bit more context about
where it's running and which orders might be useful for the outer
query.

What's been reported in [1]/messages/by-id/242fc7c6-a8aa-2daf-ac4c-0a231e2619c1@gmail.com I think could be solved by giving the
planner some way to tell subqueries what pathkeys are useful to the
outer query.

David

[1]: /messages/by-id/242fc7c6-a8aa-2daf-ac4c-0a231e2619c1@gmail.com