Considering fractional paths in Append node

Started by Nikita Malakhovabout 1 year ago25 messages

hukutoc@gmail.com

about 1 year ago

Hi hackers!

A colleague of mine, Andrei Lepikhov, has found interesting behavior
in path cost calculation for Append node - when evaluating the cheapest
path it does not take into account fractional path costs.

We've prepared a patch that forces add_paths_to_append_rel function
to consider non-parametrized fractional path costs.

The effect is easily seen in one of standard PG tests:
Vanilla (current master):
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1
loops=1)
-> Append (cost=0.00..2415.09 rows=11 width=4) (actual
time=6.308..6.310 rows=1 loops=1)
-> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual
time=6.307..6.308 rows=1 loops=1)
Join Filter: (t1.tenthous = t2.tenthous)
Rows Removed by Join Filter: 4210
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000
width=8) (actual time=0.004..0.057 rows=422 loops=1)
-> Materialize (cost=0.00..470.05 rows=10 width=4) (actual
time=0.000..0.014 rows=10 loops=422)
Storage: Memory Maximum Storage: 17kB
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10
width=4) (actual time=0.076..5.535 rows=10 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 9990
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.324 ms
Execution Time: 6.336 ms
(14 rows)

Patched, the same test:
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1
loops=1)
-> Append (cost=0.29..1383.12 rows=11 width=4) (actual
time=0.104..0.105 rows=1 loops=1)
-> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual
time=0.104..0.104 rows=1 loops=1)
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10
width=4) (actual time=0.076..0.076 rows=1 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 421
-> Index Scan using tenk1_thous_tenthous on tenk1 t1
(cost=0.29..91.30 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.334 ms
Execution Time: 0.130 ms
(11 rows)

Hope this optimization could be useful.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Nikita Malakhov

hukutoc@gmail.com

about 1 year ago

In reply to: Nikita Malakhov (#1)

1 attachment(s)

Re: Considering fractional paths in Append node

Hi hackers!

Sorry, I've forgot to attach the patch itself. Please check it out.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Attachments:

0001_append_limit_v1.patchapplication/octet-stream; name=0001_append_limit_v1.patchDownload

From 7448db6787c51f19c34a1c0f3efdcf288315d3f1 Mon Sep 17 00:00:00 2001
From: Nikita Malakhov <n.malakhov@postgrespro.ru>
Date: Sun, 13 Oct 2024 10:08:17 +0300
Subject: [PATCH] add_paths_to_append_rel() has to consider fractional paths
 while calculating the cheapest one.

When add_paths_to_append_rel evaluates path costs
if tuple_fraction has valid value (it has fractional paths)
fractional paths costs should be taken into account while
searching for the cheapest one.

Author: Andrei Lepikhov <a.lepikhov@postgrespro.ru>
---
 src/backend/optimizer/path/allpaths.c | 15 +++++++++++++--
 src/backend/optimizer/plan/planner.c  | 24 ++++++++++++++++++++++--
 src/include/optimizer/planner.h       |  4 ++++
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 172edb643a..9976907d3a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,9 +1364,20 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		if (rel->consider_startup && childrel->cheapest_startup_path != NULL)
 		{
+			Path	   *cheapest_path;
+
+			if (root->tuple_fraction > 0.0)
+			{
+				cheapest_path =
+					get_cheapest_fractional_path_ext(childrel,
+													 root->tuple_fraction, true);
+			}
+			else
+				cheapest_path = childrel->cheapest_startup_path;
+
 			/* cheapest_startup_path must not be a parameterized path. */
-			Assert(childrel->cheapest_startup_path->param_info == NULL);
-			accumulate_append_subpath(childrel->cheapest_startup_path,
+			Assert(cheapest_path->param_info == NULL);
+			accumulate_append_subpath(cheapest_path,
 									  &startup_subpaths,
 									  NULL);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0f423e9684..3c81614141 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6517,16 +6517,17 @@ make_sort_input_target(PlannerInfo *root,
 }
 
 /*
- * get_cheapest_fractional_path
+ * get_cheapest_fractional_path_ext
  *	  Find the cheapest path for retrieving a specified fraction of all
  *	  the tuples expected to be returned by the given relation.
+ *	  If nonparameterized is set, return only non-parameterized paths
  *
  * We interpret tuple_fraction the same way as grouping_planner.
  *
  * We assume set_cheapest() has been run on the given rel.
  */
 Path *
-get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
+get_cheapest_fractional_path_ext(RelOptInfo *rel, double tuple_fraction, bool nonparameterized)
 {
 	Path	   *best_path = rel->cheapest_total_path;
 	ListCell   *l;
@@ -6543,6 +6544,10 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	{
 		Path	   *path = (Path *) lfirst(l);
 
+		/* Skip parameterized paths if requested */
+		if (nonparameterized && path->param_info)
+			continue;
+
 		if (path == rel->cheapest_total_path ||
 			compare_fractional_path_costs(best_path, path, tuple_fraction) <= 0)
 			continue;
@@ -6553,6 +6558,21 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	return best_path;
 }
 
+/*
+ * get_cheapest_fractional_path
+ *	  Find the cheapest path for retrieving a specified fraction of all
+ *	  the tuples expected to be returned by the given relation.
+ *
+ * We interpret tuple_fraction the same way as grouping_planner.
+ *
+ * We assume set_cheapest() has been run on the given rel.
+ */
+Path *
+get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
+{
+	return get_cheapest_fractional_path_ext(rel, tuple_fraction, false);
+}
+
 /*
  * adjust_paths_for_srfs
  *		Fix up the Paths of the given upperrel to handle tSRFs properly.
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 5aeff21b96..6d033d367f 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,10 @@ extern void mark_partial_aggref(Aggref *agg, AggSplit aggsplit);
 extern Path *get_cheapest_fractional_path(RelOptInfo *rel,
 										  double tuple_fraction);
 
+extern Path *get_cheapest_fractional_path_ext(RelOptInfo *rel,
+										  double tuple_fraction,
+										  bool nonparameterized);
+
 extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 #endif							/* PLANNER_H */
-- 
2.25.1

Andy Fan

zhihuifan1213@163.com

about 1 year ago

In reply to: Nikita Malakhov (#2)

Re: Considering fractional paths in Append node

Nikita Malakhov <hukutoc@gmail.com> writes:

Helll Nikita,

Hi hackers!

Sorry, I've forgot to attach the patch itself. Please check it out.

Could you check if [1]/messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com is related to this subject? I think the hard part
would be which tuple_fraction to use during adding
add_paths_to_append_rel since root->tuple_fraction is on subquery level,
while the add_paths_to_append_rel is only on RelOptInfo level.

[1]: /messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com
/messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com

--
Best Regards
Andy Fan

Andrei Lepikhov

lepihov@gmail.com

about 1 year ago

In reply to: Andy Fan (#3)

Re: Considering fractional paths in Append node

On 10/17/24 07:05, Andy Fan wrote:

Nikita Malakhov <hukutoc@gmail.com> writes:

Helll Nikita,

Hi hackers!

Sorry, I've forgot to attach the patch itself. Please check it out.

Could you check if [1] is related to this subject? I think the hard part
would be which tuple_fraction to use during adding
add_paths_to_append_rel since root->tuple_fraction is on subquery level,
while the add_paths_to_append_rel is only on RelOptInfo level.

[1]
/messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com

Yes, this thread is connected to the code Nikita has proposed.
It is almost related to partitioned cases, of course.
I'm not sure about partitionwise joins - it looks overcomplicated for
the moment. We just see cases when SeqScan is preferred to IndexScan.
One logical way is to add into the Append node an alternative fractional
path and, if at the top level some Limit, IncrementalSort or another
fractional-friendly node will request only a small part of the data, the
optimiser will have the option to choose the fractional path.

--
regards, Andrei Lepikhov

Nikita Malakhov

hukutoc@gmail.com

about 1 year ago

In reply to: Andrei Lepikhov (#4)

Re: Considering fractional paths in Append node

Hi,

Andy, your words make sense, but to me it seems that in
add_paths_to_append_rel
we have no other options than tuple_fraction from root, and rows (if any)
in paths
we take into account, please correct me if I am wrong.

Thank you!

Also, on top of that I have an idea of pruning unnecessary partitions
in generate_orderedappend_paths() when we have a valid LIMIT value.
I'm currently checking if it is working correctly in multiple cases,
so 'll send it after we deal with this issue.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Andy Fan

zhihuifan1213@163.com

about 1 year ago

In reply to: Nikita Malakhov (#5)

Re: Considering fractional paths in Append node

Nikita Malakhov <hukutoc@gmail.com> writes:

Hi,

Andy, your words make sense, but to me it seems that in add_paths_to_append_rel
we have no other options than tuple_fraction from root, and rows (if any) in paths
we take into account, please correct me if I am wrong.

One of the option might be applying your logic only if we can prove the
tuple_fraction from root is same as the tuple_fraction, similar with
what I did before. But it is proved that is too complex.

Thank you!

Also, on top of that I have an idea of pruning unnecessary partitions
in generate_orderedappend_paths() when we have a valid LIMIT value.
I'm currently checking if it is working correctly in multiple cases,
so 'll send it after we deal with this issue.

--
Best Regards
Andy Fan

Andy Fan

zhihuifan1213@163.com

about 1 year ago

In reply to: Nikita Malakhov (#1)

Re: Considering fractional paths in Append node

Nikita Malakhov <hukutoc@gmail.com> writes:

The effect is easily seen in one of standard PG tests:
Vanilla (current master):
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
-> Append (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
-> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
Join Filter: (t1.tenthous = t2.tenthous)
Rows Removed by Join Filter: 4210
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422
loops=1)
-> Materialize (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
Storage: Memory Maximum Storage: 17kB
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10
loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 9990
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.324 ms
Execution Time: 6.336 ms
(14 rows)

Patched, the same test:
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
-> Append (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
-> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 421
-> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..91.30 rows=1 width=8) (actual
time=0.026..0.026 rows=1 loops=1)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.334 ms
Execution Time: 0.130 ms
(11 rows)

This is a nice result. After some more thoughts, I'm feeling the startup
cost calculation on seq scan looks a more important one to address.

Bad Plan: Append (cost=0.00..2415.09 ..) shows us the "startup cost" is 0.
Good plan: Append (cost=0.29..1383.12 ..) show us the "startup cost" is
0.29.

The major reason of this is we calculate the "startup cost" for
"SeqScan" and "Index scan" with different guidances. For the "Index
scan", the startup cost is "the cost to retrieve the first tuple",
however for "SeqScan", it is not, as we can see the startup cost for
query "SELECT * FROM tenk2 WHERE thousand = 0" has a 0 startup_cost.

In my understading, "startup cost" means the cost to retrieve the first
tuple *already*, but at [1]/messages/by-id/3783591.1721327902@sss.pgh.pa.us -- Best Regards Andy Fan, Tom said:

"""
I think that things might work out better if we redefined the startup
cost as "estimated cost to retrieve the first tuple", rather than its
current very-squishy definition as "cost to initialize the scan".
"""

The above statement makes me confused. If we take the startup cost as
cost to retrieve cost for the first tuple, we can do the below quick hack,

@@ -355,8 +355,8 @@ cost_seqscan(Path *path, PlannerInfo *root,
}

 	path->disabled_nodes = enable_seqscan ? 0 : 1;
-	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
+	path->startup_cost = startup_cost +   (cpu_run_cost + disk_run_cost) * (1 - path->rows / baserel->tuples);
 }

We get plan:

regression=# explain
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1
;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=470.12..514.00 rows=1 width=4)
-> Append (cost=470.12..952.79 rows=11 width=4)
-> Hash Join (cost=470.12..952.73 rows=10 width=4)
Hash Cond: (t1.tenthous = t2.tenthous)
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8)
-> Hash (cost=470.00..470.00 rows=10 width=4)
-> Seq Scan on tenk2 t2 (cost=469.53..470.00 rows=10 width=4)
Filter: (thousand = 0)
-> Result (cost=0.00..0.01 rows=1 width=4)
(9 rows)

set enable_hashjoin to off;

regression=# explain
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all select 1 limit 1
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Limit (cost=469.81..542.66 rows=1 width=4)
-> Append (cost=469.81..1271.12 rows=11 width=4)
-> Nested Loop (cost=469.81..1271.05 rows=10 width=4)
-> Seq Scan on tenk2 t2 (cost=469.53..470.00 rows=10 width=4)
Filter: (thousand = 0)
-> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..80.09 rows=1 width=8)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4)
(8 rows)

Looks we still have some other stuff to do, but we have seen the desired
plan has a closer cost to estimated best plan than before.

[1]: /messages/by-id/3783591.1721327902@sss.pgh.pa.us -- Best Regards Andy Fan
/messages/by-id/3783591.1721327902@sss.pgh.pa.us
--
Best Regards
Andy Fan

Andrei Lepikhov

lepihov@gmail.com

about 1 year ago

In reply to: Andy Fan (#7)

Re: Considering fractional paths in Append node

On 10/18/24 07:54, Andy Fan wrote:

Nikita Malakhov <hukutoc@gmail.com> writes:

The effect is easily seen in one of standard PG tests:

"""
I think that things might work out better if we redefined the startup
cost as "estimated cost to retrieve the first tuple", rather than its
current very-squishy definition as "cost to initialize the scan".
"""
The above statement makes me confused. If we take the startup cost as
cost to retrieve cost for the first tuple, we can do the below quick hack,

Promising way to go. Of course even in that case IndexScan usually gives
way to SeqScan (because of the additional heap fetch). And only
IndexOnlyScan may overcome it. Moreover, SeqScan first tuple cost is
contradictory issue - who knows, how much tuples it will filter before
the first tuple will be produced?

Looks we still have some other stuff to do, but we have seen the desired
plan has a closer cost to estimated best plan than before.

But our patch is about some different stuff: adding one more Append
strategy (and, as I realised recently, one promising MergeAppend too) we
give a chance upper fraction-friendly node to decide which plan is
better. Right now, AFAIK, only LIMIT node can profit from that (maybe
IncrementalSort if we include MergeAppend). But it may open a door to
improve other nodes too.

--
regards, Andrei Lepikhov

Nikita Malakhov

hukutoc@gmail.com

about 1 year ago

In reply to: Andrei Lepikhov (#8)

Re: Considering fractional paths in Append node

Hi!

Andy, one quick question - what do you think on using root->limit_tuples as
a guidance on how many
rows we have to consider in plans cost?

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#10

Andy Fan

zhihuifan1213@163.com

about 1 year ago

In reply to: Nikita Malakhov (#9)

Re: Considering fractional paths in Append node

Nikita Malakhov <hukutoc@gmail.com> writes:

Hi Nikita,

Hi!

Andy, one quick question - what do you think on using
root->limit_tuples as a guidance on how many
rows we have to consider in plans cost?

Within my exprerience, the committer probabaly dislikes the idea of
"ignoring the difference of tuple_fraction between subquery level and
RelOptInfo level" and the committers dislikes a complex soluation like
what I did in [1]/messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com. Maybe you can try to figure out completed and simple
soluation to make everyone happy, but I have no idea about on this right
now.

[1]: /messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com
/messages/by-id/CAApHDvry0nSV62kAOH3iccvfPhGPLN0Q97+=b1RsDPXDz3=CiQ@mail.gmail.com

--
Best Regards
Andy Fan

#11

Nikita Malakhov

hukutoc@gmail.com

about 1 year ago

In reply to: Andy Fan (#10)

Re: Considering fractional paths in Append node

Hi,

Andy, thank you, I've checked this thread out along with run-time partition
pruning.
I've spend some time hovering on the tuple_fraction field usage and would
disagree
with you on this topic - it is already used on the RelOptInfo level later
on, in
generate_orderedappend_paths()
I mean the following piece:
if (root->tuple_fraction > 0)
{
double path_fraction = (1.0 / root->tuple_fraction);
Path cheapest_consider_fraction;

cheapest_fractional =
get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
pathkeys, NULL, path_fraction);
...

function, so it does not seem incorrect to use its value for a single
relation in subquery -
I agree that we do not have accurate estimation at this level, but we could
use the one
we already have.
I've also tried hard to find an example where this patch could break
something,
but without success.

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#12

Andy Fan

zhihuifan1213@163.com

about 1 year ago

In reply to: Nikita Malakhov (#11)

Re: Considering fractional paths in Append node

Nikita Malakhov <hukutoc@gmail.com> writes:

Hi,

Andy, thank you, I've checked this thread out along with run-time
partition pruning.

I'm not sure the relationshipe between this topic and run-time
partitioin pruning..

I've spend some time hovering on the tuple_fraction field usage and would disagree
with you on this topic - it is already used on the RelOptInfo level later on, in
generate_orderedappend_paths()

Looks you are right that root->tuple_fraction has been used on RelOptInfo
level in the generate_orderedappend_paths(). But we also tried to
use not in the RelOptInfo level such as set_subquery_pathlist. See..

"""
/*
* We can safely pass the outer tuple_fraction down to the subquery if the
* outer level has no joining, aggregation, or sorting to do. Otherwise
* we'd better tell the subquery to plan for full retrieval. (XXX This
* could probably be made more intelligent ...)
*/
"""

I'm not sure the "more intelligent" would be just use it directly.

So I'm not saying we can't do this, just that the facts are:
(a). root->tuple_fraction are not exactly same as RelOptInfo's
tuple_fraction.
(b). We have used root->tuple_fraction in RelOptInfo in some cases and
also tried to not use it in some other case (and only use it under some
situation similar like what I did before).

Looks different committers have different opinion on this.

--
Best Regards
Andy Fan

#13

Nikita Malakhov

hukutoc@gmail.com

about 1 year ago

In reply to: Andy Fan (#12)

1 attachment(s)

Re: Considering fractional paths in Append node

Hi!

I've checked set_subquery_pathlist(), and would rather say that it has
a different purpose and I'd better not compare these functions directly.

generate_orderedappend_paths() was introduced 6 years ago, and
commit that considers tuple_fraction authored by Tomas Vondra
was suggested by Tom Lane himself. Also, generate_orderedappend_paths()
is directly used in add_paths_to_append_rel(), so I don't see any
reasons not to use tuple_fraction for the same purpose as it is used
in generate_orderedappend_paths().

Maybe, let's wait what other reviewers would say?

Thank you for your opinion and interest!

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/

Please check.

On Tue, Oct 29, 2024 at 2:38 AM Andy Fan <zhihuifan1213@163.com> wrote:

Nikita Malakhov <hukutoc@gmail.com> writes:

Hi,

Andy, thank you, I've checked this thread out along with run-time
partition pruning.

I'm not sure the relationshipe between this topic and run-time
partitioin pruning..

I've spend some time hovering on the tuple_fraction field usage and

would disagree

with you on this topic - it is already used on the RelOptInfo level

later on, in

generate_orderedappend_paths()

Looks you are right that root->tuple_fraction has been used on RelOptInfo
level in the generate_orderedappend_paths(). But we also tried to
use not in the RelOptInfo level such as set_subquery_pathlist. See..

"""
/*
* We can safely pass the outer tuple_fraction down to the subquery if the
* outer level has no joining, aggregation, or sorting to do. Otherwise
* we'd better tell the subquery to plan for full retrieval. (XXX This
* could probably be made more intelligent ...)
*/
"""

I'm not sure the "more intelligent" would be just use it directly.

So I'm not saying we can't do this, just that the facts are:
(a). root->tuple_fraction are not exactly same as RelOptInfo's
tuple_fraction.
(b). We have used root->tuple_fraction in RelOptInfo in some cases and
also tried to not use it in some other case (and only use it under some
situation similar like what I did before).

Looks different committers have different opinion on this.

--
Best Regards
Andy Fan

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Attachments:

v2-0001-append-limit-v1.patchapplication/octet-stream; name=v2-0001-append-limit-v1.patchDownload

From eb7e3c6247d9844db2a95f804144f370ddd758a8 Mon Sep 17 00:00:00 2001
From: Nikita Malakhov <n.malakhov@postgrespro.ru>
Date: Thu, 31 Oct 2024 21:56:28 +0300
Subject: [PATCH] Make add_paths_to_append_rel() to consider fractional paths
 while calculating the cheapest one.

When add_paths_to_append_rel evaluates path costs
if tuple_fraction has valid value (it has fractional paths)
fractional paths costs should be taken into account while
searching for the cheapest one, analogous to the
generate_orderedappend_paths() case root->tuple_fraction > 0.

Author: Andrei Lepikhov <a.lepikhov@postgrespro.ru>
---
 src/backend/optimizer/path/allpaths.c | 15 +++++++++++++--
 src/backend/optimizer/plan/planner.c  | 24 ++++++++++++++++++++++--
 src/include/optimizer/planner.h       |  4 ++++
 src/test/regress/expected/union.out   | 15 +++++++--------
 4 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 172edb643a..9976907d3a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,9 +1364,20 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		if (rel->consider_startup && childrel->cheapest_startup_path != NULL)
 		{
+			Path	   *cheapest_path;
+
+			if (root->tuple_fraction > 0.0)
+			{
+				cheapest_path =
+					get_cheapest_fractional_path_ext(childrel,
+													 root->tuple_fraction, true);
+			}
+			else
+				cheapest_path = childrel->cheapest_startup_path;
+
 			/* cheapest_startup_path must not be a parameterized path. */
-			Assert(childrel->cheapest_startup_path->param_info == NULL);
-			accumulate_append_subpath(childrel->cheapest_startup_path,
+			Assert(cheapest_path->param_info == NULL);
+			accumulate_append_subpath(cheapest_path,
 									  &startup_subpaths,
 									  NULL);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0f423e9684..3c81614141 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6517,16 +6517,17 @@ make_sort_input_target(PlannerInfo *root,
 }
 
 /*
- * get_cheapest_fractional_path
+ * get_cheapest_fractional_path_ext
  *	  Find the cheapest path for retrieving a specified fraction of all
  *	  the tuples expected to be returned by the given relation.
+ *	  If nonparameterized is set, return only non-parameterized paths
  *
  * We interpret tuple_fraction the same way as grouping_planner.
  *
  * We assume set_cheapest() has been run on the given rel.
  */
 Path *
-get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
+get_cheapest_fractional_path_ext(RelOptInfo *rel, double tuple_fraction, bool nonparameterized)
 {
 	Path	   *best_path = rel->cheapest_total_path;
 	ListCell   *l;
@@ -6543,6 +6544,10 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	{
 		Path	   *path = (Path *) lfirst(l);
 
+		/* Skip parameterized paths if requested */
+		if (nonparameterized && path->param_info)
+			continue;
+
 		if (path == rel->cheapest_total_path ||
 			compare_fractional_path_costs(best_path, path, tuple_fraction) <= 0)
 			continue;
@@ -6553,6 +6558,21 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	return best_path;
 }
 
+/*
+ * get_cheapest_fractional_path
+ *	  Find the cheapest path for retrieving a specified fraction of all
+ *	  the tuples expected to be returned by the given relation.
+ *
+ * We interpret tuple_fraction the same way as grouping_planner.
+ *
+ * We assume set_cheapest() has been run on the given rel.
+ */
+Path *
+get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
+{
+	return get_cheapest_fractional_path_ext(rel, tuple_fraction, false);
+}
+
 /*
  * adjust_paths_for_srfs
  *		Fix up the Paths of the given upperrel to handle tSRFs properly.
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 5aeff21b96..6d033d367f 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -57,6 +57,10 @@ extern void mark_partial_aggref(Aggref *agg, AggSplit aggsplit);
 extern Path *get_cheapest_fractional_path(RelOptInfo *rel,
 										  double tuple_fraction);
 
+extern Path *get_cheapest_fractional_path_ext(RelOptInfo *rel,
+										  double tuple_fraction,
+										  bool nonparameterized);
+
 extern Expr *preprocess_phv_expression(PlannerInfo *root, Expr *expr);
 
 #endif							/* PLANNER_H */
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..62eb4239ad 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1461,18 +1461,17 @@ select t1.unique1 from tenk1 t1
 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
    union all
 (values(1)) limit 1;
-                       QUERY PLAN                       
---------------------------------------------------------
+                             QUERY PLAN                              
+---------------------------------------------------------------------
  Limit
    ->  Append
          ->  Nested Loop
-               Join Filter: (t1.tenthous = t2.tenthous)
-               ->  Seq Scan on tenk1 t1
-               ->  Materialize
-                     ->  Seq Scan on tenk2 t2
-                           Filter: (thousand = 0)
+               ->  Seq Scan on tenk2 t2
+                     Filter: (thousand = 0)
+               ->  Index Scan using tenk1_thous_tenthous on tenk1 t1
+                     Index Cond: (tenthous = t2.tenthous)
          ->  Result
-(9 rows)
+(8 rows)
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
 explain (costs off)
-- 
2.25.1

#14

Andrei Lepikhov

lepihov@gmail.com

about 1 year ago

In reply to: Nikita Malakhov (#13)

Re: Considering fractional paths in Append node

On 11/2/24 01:18, Nikita Malakhov wrote:

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/ <https://
commitfest.postgresql.org/51/5361/>

I have played around with this feature, which looks promising for such a
tiny change. It provides a 'bottom boundary' recommendation for
appending subpaths, participating in the 'fractional branch' of paths.
As I see it works consistently with the plans, created for plain tables
filled with similar data.
According to the proposal to change SeqScan logic, IMO, Andy is right.
But it is a separate improvement because it wouldn't work in the case of
LIMIT 10 or 100, as the newly added regression tests demonstrate.
I think this feature gives sensible profit for partitionwise paths.
Pushing this knowledge into subpaths could help postgres_fdw to reduce
network traffic.

--
regards, Andrei Lepikhov

#15

Andrei Lepikhov

lepihov@gmail.com

about 1 year ago

In reply to: Andrei Lepikhov (#14)

1 attachment(s)

Re: Considering fractional paths in Append node

On 12/6/24 13:48, Andrei Lepikhov wrote:

On 11/2/24 01:18, Nikita Malakhov wrote:

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/ <https://
commitfest.postgresql.org/51/5361/>

I have played around with this feature, which looks promising for such a
tiny change. It provides a 'bottom boundary' recommendation for
appending subpaths, participating in the 'fractional branch' of paths.
As I see it works consistently with the plans, created for plain tables
filled with similar data.
According to the proposal to change SeqScan logic, IMO, Andy is right.
But it is a separate improvement because it wouldn't work in the case of
LIMIT 10 or 100, as the newly added regression tests demonstrate.
I think this feature gives sensible profit for partitionwise paths.
Pushing this knowledge into subpaths could help postgres_fdw to reduce
network traffic.

See the attached patch: regression tests added; *_ext function removed -
I think we wouldn't back-patch it into minor releases.

--
regards, Andrei Lepikhov

Attachments:

v1-0001-Teach-Append-to-consider-tuple_fraction-when-accu.patchtext/x-patch; charset=UTF-8; name=v1-0001-Teach-Append-to-consider-tuple_fraction-when-accu.patchDownload

From bd2187403ce44b0bf72c9297552049abc6644469 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 5 Dec 2024 15:15:49 +0700
Subject: [PATCH v1] Teach Append to consider tuple_fraction when accumulating
 subpaths.

This change is dedicated to more active usage of IndexScan and parameterised
NestLoop paths in partitioned cases under an Append node, as it already works
with plain tables. As newly added regression tests demonstrate, it should
provide more smartness to the partitionwise technique.

Having an indication of how many tuples are needed, it may be more meaningful
to use the 'fractional branch' subpaths of the Append path list, which are more
optimal for this specific number of tuples. Planning on a higher level, if
the optimiser needs all the tuples, it will choose non-fractional paths.
In the case when, during execution, Append needs to return fewer tuples than
declared by tuple_fraction, it would not be harmful to use the 'intermediate'
variant of paths. However, it will earn a considerable profit if a sensible set
of tuples is selected.

The change of the existing regression test demonstrates the positive outcome of
this feature - instead of scanning the whole table, the optimiser prefers
to use a parameterised scan, being aware of the only single tuple the join has
to produce to perform the query.

Discussion: https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com
---
 src/backend/optimizer/path/allpaths.c        |  18 ++-
 src/backend/optimizer/plan/planner.c         |   4 +
 src/test/regress/expected/partition_join.out | 116 +++++++++++++++++++
 src/test/regress/expected/union.out          |  15 ++-
 src/test/regress/sql/partition_join.sql      |  21 ++++
 5 files changed, 164 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 172edb643a..4ac8d8e69b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1364,9 +1364,23 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		if (rel->consider_startup && childrel->cheapest_startup_path != NULL)
 		{
+			Path	   *cheapest_path;
+
+			/*
+			 * Having an indication on how much tuples the query should provide
+			 * the optimiser tries to choose the path, optimal for that specific
+			 * number of tuples.
+			 */
+			if (root->tuple_fraction > 0.0)
+				cheapest_path =
+					get_cheapest_fractional_path(childrel,
+													 root->tuple_fraction);
+			else
+				cheapest_path = childrel->cheapest_startup_path;
+
 			/* cheapest_startup_path must not be a parameterized path. */
-			Assert(childrel->cheapest_startup_path->param_info == NULL);
-			accumulate_append_subpath(childrel->cheapest_startup_path,
+			Assert(cheapest_path->param_info == NULL);
+			accumulate_append_subpath(cheapest_path,
 									  &startup_subpaths,
 									  NULL);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..0eed566185 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6590,6 +6590,10 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	{
 		Path	   *path = (Path *) lfirst(l);
 
+		/* Skip parameterized paths if requested */
+		if (path->param_info)
+			continue;
+
 		if (path == rel->cheapest_total_path ||
 			compare_fractional_path_costs(best_path, path, tuple_fraction) <= 0)
 			continue;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 108f9ecb44..3f7543074f 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -5225,6 +5225,122 @@ SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id DE
                      Index Cond: (id = x_2.id)
 (11 rows)
 
+--
+-- Test Append's fractional paths
+--
+CREATE INDEX pht1_c_idx ON pht1(c);
+-- SeqScan might be the best choice if we need one single tuple
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Limit
+   ->  Append
+         ->  Nested Loop
+               Join Filter: (p1_1.c = p2_1.c)
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Materialize
+                     ->  Seq Scan on pht1_p1 p2_1
+         ->  Nested Loop
+               Join Filter: (p1_2.c = p2_2.c)
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Materialize
+                     ->  Seq Scan on pht1_p2 p2_2
+         ->  Nested Loop
+               Join Filter: (p1_3.c = p2_3.c)
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Materialize
+                     ->  Seq Scan on pht1_p3 p2_3
+(17 rows)
+
+-- Increase number of tuples requested and an IndexScan will be chosen
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Memoize
+                     Cache Key: p1_1.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
+                           Index Cond: (c = p1_1.c)
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Memoize
+                     Cache Key: p1_2.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
+                           Index Cond: (c = p1_2.c)
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Memoize
+                     Cache Key: p1_3.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
+                           Index Cond: (c = p1_3.c)
+(23 rows)
+
+-- If the almost all the data should be fetched - prefer SeqScan
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Limit
+   ->  Append
+         ->  Hash Join
+               Hash Cond: (p1_1.c = p2_1.c)
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Hash
+                     ->  Seq Scan on pht1_p1 p2_1
+         ->  Hash Join
+               Hash Cond: (p1_2.c = p2_2.c)
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Hash
+                     ->  Seq Scan on pht1_p2 p2_2
+         ->  Hash Join
+               Hash Cond: (p1_3.c = p2_3.c)
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Hash
+                     ->  Seq Scan on pht1_p3 p2_3
+(17 rows)
+
+SET max_parallel_workers_per_gather = 1;
+SET debug_parallel_query = on;
+-- Partial paths also should be smart enough to employ limits
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
+ Gather
+   Workers Planned: 1
+   Single Copy: true
+   ->  Limit
+         ->  Append
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p1 p1_1
+                     ->  Memoize
+                           Cache Key: p1_1.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
+                                 Index Cond: (c = p1_1.c)
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p2 p1_2
+                     ->  Memoize
+                           Cache Key: p1_2.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
+                                 Index Cond: (c = p1_2.c)
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p3 p1_3
+                     ->  Memoize
+                           Cache Key: p1_3.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
+                                 Index Cond: (c = p1_3.c)
+(26 rows)
+
+RESET debug_parallel_query;
+-- Remove indexes from the partitioned table and its partitions
+DROP INDEX pht1_c_idx CASCADE;
 -- cleanup
 DROP TABLE fract_t;
 RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index c73631a9a1..62eb4239ad 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1461,18 +1461,17 @@ select t1.unique1 from tenk1 t1
 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
    union all
 (values(1)) limit 1;
-                       QUERY PLAN                       
---------------------------------------------------------
+                             QUERY PLAN                              
+---------------------------------------------------------------------
  Limit
    ->  Append
          ->  Nested Loop
-               Join Filter: (t1.tenthous = t2.tenthous)
-               ->  Seq Scan on tenk1 t1
-               ->  Materialize
-                     ->  Seq Scan on tenk2 t2
-                           Filter: (thousand = 0)
+               ->  Seq Scan on tenk2 t2
+                     Filter: (thousand = 0)
+               ->  Index Scan using tenk1_thous_tenthous on tenk1 t1
+                     Index Cond: (tenthous = t2.tenthous)
          ->  Result
-(9 rows)
+(8 rows)
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
 explain (costs off)
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index e84b65f444..1410369550 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -1221,6 +1221,27 @@ SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id AS
 EXPLAIN (COSTS OFF)
 SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id DESC LIMIT 10;
 
+--
+-- Test Append's fractional paths
+--
+
+CREATE INDEX pht1_c_idx ON pht1(c);
+-- SeqScan might be the best choice if we need one single tuple
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
+-- Increase number of tuples requested and an IndexScan will be chosen
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+-- If the almost all the data should be fetched - prefer SeqScan
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
+
+SET max_parallel_workers_per_gather = 1;
+SET debug_parallel_query = on;
+-- Partial paths also should be smart enough to employ limits
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+RESET debug_parallel_query;
+
+-- Remove indexes from the partitioned table and its partitions
+DROP INDEX pht1_c_idx CASCADE;
+
 -- cleanup
 DROP TABLE fract_t;
 
-- 
2.39.5

#16

Alexander Korotkov

aekorotkov@gmail.com

11 months ago

In reply to: Andy Fan (#7)

Re: Considering fractional paths in Append node

Hi, Andy!

On Fri, Oct 18, 2024 at 3:55 AM Andy Fan <zhihuifan1213@163.com> wrote:

Nikita Malakhov <hukutoc@gmail.com> writes:

The effect is easily seen in one of standard PG tests:
Vanilla (current master):
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.00..219.55 rows=1 width=4) (actual time=6.309..6.312 rows=1 loops=1)
-> Append (cost=0.00..2415.09 rows=11 width=4) (actual time=6.308..6.310 rows=1 loops=1)
-> Nested Loop (cost=0.00..2415.03 rows=10 width=4) (actual time=6.307..6.308 rows=1 loops=1)
Join Filter: (t1.tenthous = t2.tenthous)
Rows Removed by Join Filter: 4210
-> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=10000 width=8) (actual time=0.004..0.057 rows=422
loops=1)
-> Materialize (cost=0.00..470.05 rows=10 width=4) (actual time=0.000..0.014 rows=10 loops=422)
Storage: Memory Maximum Storage: 17kB
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..5.535 rows=10
loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 9990
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.324 ms
Execution Time: 6.336 ms
(14 rows)

Patched, the same test:
explain analyze
select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.29..126.00 rows=1 width=4) (actual time=0.105..0.106 rows=1 loops=1)
-> Append (cost=0.29..1383.12 rows=11 width=4) (actual time=0.104..0.105 rows=1 loops=1)
-> Nested Loop (cost=0.29..1383.05 rows=10 width=4) (actual time=0.104..0.104 rows=1 loops=1)
-> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) (actual time=0.076..0.076 rows=1 loops=1)
Filter: (thousand = 0)
Rows Removed by Filter: 421
-> Index Scan using tenk1_thous_tenthous on tenk1 t1 (cost=0.29..91.30 rows=1 width=8) (actual
time=0.026..0.026 rows=1 loops=1)
Index Cond: (tenthous = t2.tenthous)
-> Result (cost=0.00..0.01 rows=1 width=4) (never executed)
Planning Time: 0.334 ms
Execution Time: 0.130 ms
(11 rows)

This is a nice result. After some more thoughts, I'm feeling the startup
cost calculation on seq scan looks a more important one to address.

Bad Plan: Append (cost=0.00..2415.09 ..) shows us the "startup cost" is 0.
Good plan: Append (cost=0.29..1383.12 ..) show us the "startup cost" is
0.29.

The major reason of this is we calculate the "startup cost" for
"SeqScan" and "Index scan" with different guidances. For the "Index
scan", the startup cost is "the cost to retrieve the first tuple",
however for "SeqScan", it is not, as we can see the startup cost for
query "SELECT * FROM tenk2 WHERE thousand = 0" has a 0 startup_cost.

In my understading, "startup cost" means the cost to retrieve the first
tuple *already*, but at [1], Tom said:

"""
I think that things might work out better if we redefined the startup
cost as "estimated cost to retrieve the first tuple", rather than its
current very-squishy definition as "cost to initialize the scan".
"""

The above statement makes me confused. If we take the startup cost as
cost to retrieve cost for the first tuple, we can do the below quick hack,

@@ -355,8 +355,8 @@ cost_seqscan(Path *path, PlannerInfo *root,
}
path->disabled_nodes = enable_seqscan ? 0 : 1;
-       path->startup_cost = startup_cost;
path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
+       path->startup_cost = startup_cost +   (cpu_run_cost + disk_run_cost) * (1 - path->rows / baserel->tuples);
}

You can check I've already proposed similar change years before.
/messages/by-id/CAPpHfdvfDAXPXhFQT3Ww=kZ4tpyAsD07_oz8-fh0=p6eckEBKQ@mail.gmail.com
It appears that according to the current model startup_cost is not the
cost of the first tuple. It should be the cost of preparation work,
while extraction of tuples should be distributed uniformly through
total_cost - startup_cost.

------
Regards,
Alexander Korotkov
Supabase

#17

Alexander Korotkov

aekorotkov@gmail.com

11 months ago

In reply to: Andrei Lepikhov (#15)

Re: Considering fractional paths in Append node

Hi, Andrei!

On Fri, Dec 6, 2024 at 10:13 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 12/6/24 13:48, Andrei Lepikhov wrote:

On 11/2/24 01:18, Nikita Malakhov wrote:

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/ <https://
commitfest.postgresql.org/51/5361/>

I have played around with this feature, which looks promising for such a
tiny change. It provides a 'bottom boundary' recommendation for
appending subpaths, participating in the 'fractional branch' of paths.
As I see it works consistently with the plans, created for plain tables
filled with similar data.
According to the proposal to change SeqScan logic, IMO, Andy is right.
But it is a separate improvement because it wouldn't work in the case of
LIMIT 10 or 100, as the newly added regression tests demonstrate.
I think this feature gives sensible profit for partitionwise paths.
Pushing this knowledge into subpaths could help postgres_fdw to reduce
network traffic.

See the attached patch: regression tests added; *_ext function removed -
I think we wouldn't back-patch it into minor releases.

Thank you for revising this patch. I've a couple of questions for you.
1. You revise get_cheapest_fractional_path() to reject parametrized
paths in all the cases. Are you sure this wouldn't affect other
callers of this function?
2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
do you think we could limit this to some simple cases? For instance,
check that RelOptInfo is the final result relation for given root.

Links.
/messages/by-id/871q0fvbje.fsf@163.com

------
Regards,
Alexander Korotkov
Supabase

#18

Andrei Lepikhov

lepihov@gmail.com

11 months ago

In reply to: Alexander Korotkov (#17)

1 attachment(s)

Re: Considering fractional paths in Append node

On 2/3/2025 09:53, Alexander Korotkov wrote:

Hi, Andrei!

On Fri, Dec 6, 2024 at 10:13 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 12/6/24 13:48, Andrei Lepikhov wrote:

On 11/2/24 01:18, Nikita Malakhov wrote:

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/ <https://
commitfest.postgresql.org/51/5361/>

I have played around with this feature, which looks promising for such a
tiny change. It provides a 'bottom boundary' recommendation for
appending subpaths, participating in the 'fractional branch' of paths.
As I see it works consistently with the plans, created for plain tables
filled with similar data.
According to the proposal to change SeqScan logic, IMO, Andy is right.
But it is a separate improvement because it wouldn't work in the case of
LIMIT 10 or 100, as the newly added regression tests demonstrate.
I think this feature gives sensible profit for partitionwise paths.
Pushing this knowledge into subpaths could help postgres_fdw to reduce
network traffic.

See the attached patch: regression tests added; *_ext function removed -
I think we wouldn't back-patch it into minor releases.

Thank you for revising this patch. I've a couple of questions for you.
1. You revise get_cheapest_fractional_path() to reject parametrized
paths in all the cases. Are you sure this wouldn't affect other
callers of this function?

Yes, in my opinion, it may not affect any caller for now. As you can
see, the routine is used for upper-rels only. Such RelOptInfos can't
contain parameterised paths for now. I already attempted to change the
parameterisation code and let queries pull parameterisation from
subqueries and subplans, but it is too invasive. It seems this logic
will stay as it is for a long time.
I added a comment to explain this logic.
What's more, since all subpaths of an Append must have the same
parameterisation, we simplify the case and just don't discover this option.

2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
do you think we could limit this to some simple cases? For instance,
check that RelOptInfo is the final result relation for given root.

I believe that using tuple_fraction is not an issue. Instead, it serves
as a flag that allows the upper-level optimisation to consider
additional options. The upper-level optimiser has more variants to
examine and will select the optimal path based on the knowledge
available at that level. Therefore, we're not introducing a mistake
here; we're simply adding extra work in the narrow case. However, having
only the bottom-up planning process, I don't see how we could avoid this
additional workload.

--
regards, Andrei Lepikhov

Attachments:

v2-0001-Teach-Append-to-consider-tuple_fraction-when-accu.patchtext/plain; charset=UTF-8; name=v2-0001-Teach-Append-to-consider-tuple_fraction-when-accu.patchDownload

From 4de8b1f9d8f22472c8811aad0f5853c436e79e4f Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 5 Dec 2024 15:15:49 +0700
Subject: [PATCH v2] Teach Append to consider tuple_fraction when accumulating
 subpaths.

This change is dedicated to more active usage of IndexScan and parameterised
NestLoop paths in partitioned cases under an Append node, as it already works
with plain tables. As newly added regression tests demonstrate, it should
provide more smartness to the partitionwise technique.

Having an indication of how many tuples are needed, it may be more meaningful
to use the 'fractional branch' subpaths of the Append path list, which are more
optimal for this specific number of tuples. Planning on a higher level, if
the optimiser needs all the tuples, it will choose non-fractional paths.
In the case when, during execution, Append needs to return fewer tuples than
declared by tuple_fraction, it would not be harmful to use the 'intermediate'
variant of paths. However, it will earn a considerable profit if a sensible set
of tuples is selected.

The change of the existing regression test demonstrates the positive outcome of
this feature - instead of scanning the whole table, the optimiser prefers
to use a parameterised scan, being aware of the only single tuple the join has
to produce to perform the query.

Discussion: https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com
---
 src/backend/optimizer/path/allpaths.c        |  18 ++-
 src/backend/optimizer/plan/planner.c         |   7 ++
 src/test/regress/expected/partition_join.out | 116 +++++++++++++++++++
 src/test/regress/expected/union.out          |  15 ++-
 src/test/regress/sql/partition_join.sql      |  21 ++++
 5 files changed, 167 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b5bc9b602e2..d01acaf0b14 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1371,9 +1371,23 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		if (rel->consider_startup && childrel->cheapest_startup_path != NULL)
 		{
+			Path	   *cheapest_path;
+
+			/*
+			 * Having an indication on how much tuples the query should provide
+			 * the optimiser tries to choose the path, optimal for that specific
+			 * number of tuples.
+			 */
+			if (root->tuple_fraction > 0.0)
+				cheapest_path =
+					get_cheapest_fractional_path(childrel,
+													 root->tuple_fraction);
+			else
+				cheapest_path = childrel->cheapest_startup_path;
+
 			/* cheapest_startup_path must not be a parameterized path. */
-			Assert(childrel->cheapest_startup_path->param_info == NULL);
-			accumulate_append_subpath(childrel->cheapest_startup_path,
+			Assert(cheapest_path->param_info == NULL);
+			accumulate_append_subpath(cheapest_path,
 									  &startup_subpaths,
 									  NULL);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36ee6dd43de..1a3f2670208 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6414,6 +6414,10 @@ make_sort_input_target(PlannerInfo *root,
  *	  Find the cheapest path for retrieving a specified fraction of all
  *	  the tuples expected to be returned by the given relation.
  *
+ * Do not consider parameterized paths; the caller might need a path as an
+ * append subpath and could become limited by the treatment of similar
+ * parameterization of all the subpaths. See 5b7b551 for details.
+ *
  * We interpret tuple_fraction the same way as grouping_planner.
  *
  * We assume set_cheapest() has been run on the given rel.
@@ -6436,6 +6440,9 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	{
 		Path	   *path = (Path *) lfirst(l);
 
+		if (path->param_info)
+			continue;
+
 		if (path == rel->cheapest_total_path ||
 			compare_fractional_path_costs(best_path, path, tuple_fraction) <= 0)
 			continue;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 34b963ce6fe..fb17eb37b43 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -5260,6 +5260,122 @@ SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id DE
                      Index Cond: (id = x_2.id)
 (11 rows)
 
+--
+-- Test Append's fractional paths
+--
+CREATE INDEX pht1_c_idx ON pht1(c);
+-- SeqScan might be the best choice if we need one single tuple
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Limit
+   ->  Append
+         ->  Nested Loop
+               Join Filter: (p1_1.c = p2_1.c)
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Materialize
+                     ->  Seq Scan on pht1_p1 p2_1
+         ->  Nested Loop
+               Join Filter: (p1_2.c = p2_2.c)
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Materialize
+                     ->  Seq Scan on pht1_p2 p2_2
+         ->  Nested Loop
+               Join Filter: (p1_3.c = p2_3.c)
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Materialize
+                     ->  Seq Scan on pht1_p3 p2_3
+(17 rows)
+
+-- Increase number of tuples requested and an IndexScan will be chosen
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Memoize
+                     Cache Key: p1_1.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
+                           Index Cond: (c = p1_1.c)
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Memoize
+                     Cache Key: p1_2.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
+                           Index Cond: (c = p1_2.c)
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Memoize
+                     Cache Key: p1_3.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
+                           Index Cond: (c = p1_3.c)
+(23 rows)
+
+-- If the almost all the data should be fetched - prefer SeqScan
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Limit
+   ->  Append
+         ->  Hash Join
+               Hash Cond: (p1_1.c = p2_1.c)
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Hash
+                     ->  Seq Scan on pht1_p1 p2_1
+         ->  Hash Join
+               Hash Cond: (p1_2.c = p2_2.c)
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Hash
+                     ->  Seq Scan on pht1_p2 p2_2
+         ->  Hash Join
+               Hash Cond: (p1_3.c = p2_3.c)
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Hash
+                     ->  Seq Scan on pht1_p3 p2_3
+(17 rows)
+
+SET max_parallel_workers_per_gather = 1;
+SET debug_parallel_query = on;
+-- Partial paths also should be smart enough to employ limits
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
+ Gather
+   Workers Planned: 1
+   Single Copy: true
+   ->  Limit
+         ->  Append
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p1 p1_1
+                     ->  Memoize
+                           Cache Key: p1_1.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
+                                 Index Cond: (c = p1_1.c)
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p2 p1_2
+                     ->  Memoize
+                           Cache Key: p1_2.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
+                                 Index Cond: (c = p1_2.c)
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p3 p1_3
+                     ->  Memoize
+                           Cache Key: p1_3.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
+                                 Index Cond: (c = p1_3.c)
+(26 rows)
+
+RESET debug_parallel_query;
+-- Remove indexes from the partitioned table and its partitions
+DROP INDEX pht1_c_idx CASCADE;
 -- cleanup
 DROP TABLE fract_t;
 RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index caa8fe70a05..96962817ed4 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1472,18 +1472,17 @@ select t1.unique1 from tenk1 t1
 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
    union all
 (values(1)) limit 1;
-                       QUERY PLAN                       
---------------------------------------------------------
+                             QUERY PLAN                              
+---------------------------------------------------------------------
  Limit
    ->  Append
          ->  Nested Loop
-               Join Filter: (t1.tenthous = t2.tenthous)
-               ->  Seq Scan on tenk1 t1
-               ->  Materialize
-                     ->  Seq Scan on tenk2 t2
-                           Filter: (thousand = 0)
+               ->  Seq Scan on tenk2 t2
+                     Filter: (thousand = 0)
+               ->  Index Scan using tenk1_thous_tenthous on tenk1 t1
+                     Index Cond: (tenthous = t2.tenthous)
          ->  Result
-(9 rows)
+(8 rows)
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
 explain (costs off)
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 26b8e3d063f..515d94878ab 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -1225,6 +1225,27 @@ SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id AS
 EXPLAIN (COSTS OFF)
 SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id DESC LIMIT 10;
 
+--
+-- Test Append's fractional paths
+--
+
+CREATE INDEX pht1_c_idx ON pht1(c);
+-- SeqScan might be the best choice if we need one single tuple
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
+-- Increase number of tuples requested and an IndexScan will be chosen
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+-- If the almost all the data should be fetched - prefer SeqScan
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
+
+SET max_parallel_workers_per_gather = 1;
+SET debug_parallel_query = on;
+-- Partial paths also should be smart enough to employ limits
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+RESET debug_parallel_query;
+
+-- Remove indexes from the partitioned table and its partitions
+DROP INDEX pht1_c_idx CASCADE;
+
 -- cleanup
 DROP TABLE fract_t;
 
-- 
2.48.1

#19

Alena Rybakina

a.rybakina@postgrespro.ru

11 months ago

In reply to: Andrei Lepikhov (#18)

Re: Considering fractional paths in Append node

Hi! Thank you for your work on this subject.

I agree with your code but one phrase in commit message was confusing
for me:

"This change is dedicated to more active usage of IndexScan and
parameterised
NestLoop paths in partitioned cases under an Append node, as it already
works
with plain tables."

As I understood this thread is about the optimization that allows
considering Index Scan if we need the limited number of rows and
later the commit message contains it. I didn't understand completely the
sentence above to be honest.

--
Regards,
Alena Rybakina
Postgres Professional

#20

Alena Rybakina

a.rybakina@postgrespro.ru

10 months ago

In reply to: Alena Rybakina (#19)

Re: Considering fractional paths in Append node

On 03.03.2025 14:17, Alena Rybakina wrote:

Hi! Thank you for your work on this subject.

I agree with your code but one phrase in commit message was confusing
for me:

"This change is dedicated to more active usage of IndexScan and
parameterised
NestLoop paths in partitioned cases under an Append node, as it
already works
with plain tables."

As I understood this thread is about the optimization that allows
considering Index Scan if we need the limited number of rows and
later the commit message contains it. I didn't understand completely
the sentence above to be honest.

Sorry for noise, I understood it and I think everything is fine there.

--
Regards,
Alena Rybakina
Postgres Professional

#21

Alexander Korotkov

aekorotkov@gmail.com

10 months ago

In reply to: Andrei Lepikhov (#18)

Re: Considering fractional paths in Append node

On Mon, Mar 3, 2025 at 1:04 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 2/3/2025 09:53, Alexander Korotkov wrote:

Hi, Andrei!

On Fri, Dec 6, 2024 at 10:13 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 12/6/24 13:48, Andrei Lepikhov wrote:

On 11/2/24 01:18, Nikita Malakhov wrote:

I've corrected failing test and created a patch at Commitfest:
https://commitfest.postgresql.org/51/5361/ <https://
commitfest.postgresql.org/51/5361/>

I have played around with this feature, which looks promising for such a
tiny change. It provides a 'bottom boundary' recommendation for
appending subpaths, participating in the 'fractional branch' of paths.
As I see it works consistently with the plans, created for plain tables
filled with similar data.
According to the proposal to change SeqScan logic, IMO, Andy is right.
But it is a separate improvement because it wouldn't work in the case of
LIMIT 10 or 100, as the newly added regression tests demonstrate.
I think this feature gives sensible profit for partitionwise paths.
Pushing this knowledge into subpaths could help postgres_fdw to reduce
network traffic.

See the attached patch: regression tests added; *_ext function removed -
I think we wouldn't back-patch it into minor releases.

Thank you for revising this patch. I've a couple of questions for you.
1. You revise get_cheapest_fractional_path() to reject parametrized
paths in all the cases. Are you sure this wouldn't affect other
callers of this function?

Yes, in my opinion, it may not affect any caller for now. As you can
see, the routine is used for upper-rels only. Such RelOptInfos can't
contain parameterised paths for now. I already attempted to change the
parameterisation code and let queries pull parameterisation from
subqueries and subplans, but it is too invasive. It seems this logic
will stay as it is for a long time.
I added a comment to explain this logic.
What's more, since all subpaths of an Append must have the same
parameterisation, we simplify the case and just don't discover this option.

Makes sense. I have tries to add Assert(!path->param_info) to
get_cheapest_fractional_path() without the patch, and tests get
passed.

2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
do you think we could limit this to some simple cases? For instance,
check that RelOptInfo is the final result relation for given root.

I believe that using tuple_fraction is not an issue. Instead, it serves
as a flag that allows the upper-level optimisation to consider
additional options. The upper-level optimiser has more variants to
examine and will select the optimal path based on the knowledge
available at that level. Therefore, we're not introducing a mistake
here; we're simply adding extra work in the narrow case. However, having
only the bottom-up planning process, I don't see how we could avoid this
additional workload.

Yes, but if we can assume root->tuple_fraction applies to result of
Append, it's strange we apply the same tuple fraction to all the child
rels. Latter rels should less likely be used at all and perhaps
should have less tuple_fraction.

------
Regards,
Alexander Korotkov
Supabase

#22

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: Alexander Korotkov (#21)

Re: Considering fractional paths in Append node

On 5/3/2025 03:27, Alexander Korotkov wrote:

On Mon, Mar 3, 2025 at 1:04 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
do you think we could limit this to some simple cases? For instance,
check that RelOptInfo is the final result relation for given root.

I believe that using tuple_fraction is not an issue. Instead, it serves
as a flag that allows the upper-level optimisation to consider
additional options. The upper-level optimiser has more variants to
examine and will select the optimal path based on the knowledge
available at that level. Therefore, we're not introducing a mistake
here; we're simply adding extra work in the narrow case. However, having
only the bottom-up planning process, I don't see how we could avoid this
additional workload.

Yes, but if we can assume root->tuple_fraction applies to result of
Append, it's strange we apply the same tuple fraction to all the child
rels. Latter rels should less likely be used at all and perhaps
should have less tuple_fraction.

Of course, it may happen. But I'm not sure it is a common rule.
Using LIMIT, we usually select data according to specific clauses.
Imagine, we need TOP-100 ranked goods. Appending partitions of goods, we
will use the index on the 'rating' column. But who knows how top-rated
goods are spread across partitions? Maybe a single partition contains
all of them? So, we need to select 100 top-rated goods from each partition.
Hence, applying the same limit to each partition seems reasonable, right?

--
regards, Andrei Lepikhov

#23

Alexander Korotkov

aekorotkov@gmail.com

10 months ago

In reply to: Andrei Lepikhov (#22)

Re: Considering fractional paths in Append node

On Wed, Mar 5, 2025 at 8:32 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 5/3/2025 03:27, Alexander Korotkov wrote:

On Mon, Mar 3, 2025 at 1:04 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
do you think we could limit this to some simple cases? For instance,
check that RelOptInfo is the final result relation for given root.

I believe that using tuple_fraction is not an issue. Instead, it serves
as a flag that allows the upper-level optimisation to consider
additional options. The upper-level optimiser has more variants to
examine and will select the optimal path based on the knowledge
available at that level. Therefore, we're not introducing a mistake
here; we're simply adding extra work in the narrow case. However, having
only the bottom-up planning process, I don't see how we could avoid this
additional workload.

Yes, but if we can assume root->tuple_fraction applies to result of
Append, it's strange we apply the same tuple fraction to all the child
rels. Latter rels should less likely be used at all and perhaps
should have less tuple_fraction.

Of course, it may happen. But I'm not sure it is a common rule.
Using LIMIT, we usually select data according to specific clauses.
Imagine, we need TOP-100 ranked goods. Appending partitions of goods, we
will use the index on the 'rating' column. But who knows how top-rated
goods are spread across partitions? Maybe a single partition contains
all of them? So, we need to select 100 top-rated goods from each partition.
Hence, applying the same limit to each partition seems reasonable, right?

Ok, I didn't notice add_paths_to_append_rel() is used for MergeAppend
as well. I thought again about regular Append. If can have required
number of rows from the first few children relations, the error of
tuple fraction shouldn't influence plans much, and other children
relations wouldn't be used at all. But if we don't, we unlikely get
prediction of selectivity accurate enough to predict which exact
children relations are going to be used. So, usage root tuple
fraction for every child relation would be safe. So, this approach
makes sense to me.

------
Regards,
Alexander Korotkov
Supabase

#24

Alexander Korotkov

aekorotkov@gmail.com

10 months ago

In reply to: Alexander Korotkov (#23)

1 attachment(s)

Re: Considering fractional paths in Append node

On Wed, Mar 5, 2025 at 1:20 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Mar 5, 2025 at 8:32 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 5/3/2025 03:27, Alexander Korotkov wrote:

On Mon, Mar 3, 2025 at 1:04 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

2. As usage of root->tuple_fraction RelOptInfo it has been criticized,
do you think we could limit this to some simple cases? For instance,
check that RelOptInfo is the final result relation for given root.

I believe that using tuple_fraction is not an issue. Instead, it serves
as a flag that allows the upper-level optimisation to consider
additional options. The upper-level optimiser has more variants to
examine and will select the optimal path based on the knowledge
available at that level. Therefore, we're not introducing a mistake
here; we're simply adding extra work in the narrow case. However, having
only the bottom-up planning process, I don't see how we could avoid this
additional workload.

Yes, but if we can assume root->tuple_fraction applies to result of
Append, it's strange we apply the same tuple fraction to all the child
rels. Latter rels should less likely be used at all and perhaps
should have less tuple_fraction.

Of course, it may happen. But I'm not sure it is a common rule.
Using LIMIT, we usually select data according to specific clauses.
Imagine, we need TOP-100 ranked goods. Appending partitions of goods, we
will use the index on the 'rating' column. But who knows how top-rated
goods are spread across partitions? Maybe a single partition contains
all of them? So, we need to select 100 top-rated goods from each partition.
Hence, applying the same limit to each partition seems reasonable, right?

Ok, I didn't notice add_paths_to_append_rel() is used for MergeAppend
as well. I thought again about regular Append. If can have required
number of rows from the first few children relations, the error of
tuple fraction shouldn't influence plans much, and other children
relations wouldn't be used at all. But if we don't, we unlikely get
prediction of selectivity accurate enough to predict which exact
children relations are going to be used. So, usage root tuple
fraction for every child relation would be safe. So, this approach
makes sense to me.

I've slightly revised the commit message and comments. I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v3-0001-Teach-Append-to-consider-tuple_fraction-when-accu.patchapplication/x-patch; name=v3-0001-Teach-Append-to-consider-tuple_fraction-when-accu.patchDownload

From c9c0efa98b74d76912845339cb5b70dba3c8cb17 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 5 Dec 2024 15:15:49 +0700
Subject: [PATCH v3] Teach Append to consider tuple_fraction when accumulating
 subpaths.

This change is dedicated to more active usage of IndexScan and parameterized
NestLoop paths in partitioned cases under an Append node, as it already works
with plain tables.  As newly added regression tests demonstrate, it should
provide more smartness to the partitionwise technique.

With an indication of how many tuples are needed, it may be more meaningful
to use the 'fractional branch' subpaths of the Append path list, which are
more optimal for this specific number of tuples.  Planning on a higher level,
if the optimizer needs all the tuples, it will choose non-fractional paths.
In the case when, during execution, Append needs to return fewer tuples than
declared by tuple_fraction, it would not be harmful to use the 'intermediate'
variant of paths.  However, it will earn a considerable profit if a sensible
set of tuples is selected.

The change of the existing regression test demonstrates the positive outcome
of this feature: instead of scanning the whole table, the optimizer prefers
to use a parameterized scan, being aware of the only single tuple the join
has to produce to perform the query.

Discussion: https://www.postgresql.org/message-id/flat/CAN-LCVPxnWB39CUBTgOQ9O7Dd8DrA_tpT1EY3LNVnUuvAX1NjA%40mail.gmail.com
Author: Nikita Malakhov <hukutoc@gmail.com>
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Andy Fan <zhihuifan1213@163.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
---
 src/backend/optimizer/path/allpaths.c        |  18 ++-
 src/backend/optimizer/plan/planner.c         |   8 ++
 src/test/regress/expected/partition_join.out | 116 +++++++++++++++++++
 src/test/regress/expected/union.out          |  15 ++-
 src/test/regress/sql/partition_join.sql      |  21 ++++
 5 files changed, 168 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b5bc9b602e2..9e7f01a9684 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1371,9 +1371,23 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		if (rel->consider_startup && childrel->cheapest_startup_path != NULL)
 		{
+			Path	   *cheapest_path;
+
+			/*
+			 * With an indication of how many tuples the query should provide,
+			 * the optimizer tries to choose the path optimal for that
+			 * specific number of tuples.
+			 */
+			if (root->tuple_fraction > 0.0)
+				cheapest_path =
+					get_cheapest_fractional_path(childrel,
+													 root->tuple_fraction);
+			else
+				cheapest_path = childrel->cheapest_startup_path;
+
 			/* cheapest_startup_path must not be a parameterized path. */
-			Assert(childrel->cheapest_startup_path->param_info == NULL);
-			accumulate_append_subpath(childrel->cheapest_startup_path,
+			Assert(cheapest_path->param_info == NULL);
+			accumulate_append_subpath(cheapest_path,
 									  &startup_subpaths,
 									  NULL);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36ee6dd43de..014e80c30e6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6414,6 +6414,11 @@ make_sort_input_target(PlannerInfo *root,
  *	  Find the cheapest path for retrieving a specified fraction of all
  *	  the tuples expected to be returned by the given relation.
  *
+ * Do not consider parameterized paths.  If the caller needs a path for upper
+ * rel, it can't have parameterized paths.  If the caller needs an append
+ * subpath, it could become limited by the treatment of similar
+ * parameterization of all the subpaths.
+ *
  * We interpret tuple_fraction the same way as grouping_planner.
  *
  * We assume set_cheapest() has been run on the given rel.
@@ -6436,6 +6441,9 @@ get_cheapest_fractional_path(RelOptInfo *rel, double tuple_fraction)
 	{
 		Path	   *path = (Path *) lfirst(l);
 
+		if (path->param_info)
+			continue;
+
 		if (path == rel->cheapest_total_path ||
 			compare_fractional_path_costs(best_path, path, tuple_fraction) <= 0)
 			continue;
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 34b963ce6fe..938cedd79ad 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -5260,6 +5260,122 @@ SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id DE
                      Index Cond: (id = x_2.id)
 (11 rows)
 
+--
+-- Test Append's fractional paths
+--
+CREATE INDEX pht1_c_idx ON pht1(c);
+-- SeqScan might be the best choice if we need one single tuple
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Limit
+   ->  Append
+         ->  Nested Loop
+               Join Filter: (p1_1.c = p2_1.c)
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Materialize
+                     ->  Seq Scan on pht1_p1 p2_1
+         ->  Nested Loop
+               Join Filter: (p1_2.c = p2_2.c)
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Materialize
+                     ->  Seq Scan on pht1_p2 p2_2
+         ->  Nested Loop
+               Join Filter: (p1_3.c = p2_3.c)
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Materialize
+                     ->  Seq Scan on pht1_p3 p2_3
+(17 rows)
+
+-- Increase number of tuples requested and an IndexScan will be chosen
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Limit
+   ->  Append
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Memoize
+                     Cache Key: p1_1.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
+                           Index Cond: (c = p1_1.c)
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Memoize
+                     Cache Key: p1_2.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
+                           Index Cond: (c = p1_2.c)
+         ->  Nested Loop
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Memoize
+                     Cache Key: p1_3.c
+                     Cache Mode: logical
+                     ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
+                           Index Cond: (c = p1_3.c)
+(23 rows)
+
+-- If almost all the data should be fetched - prefer SeqScan
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
+                    QUERY PLAN                    
+--------------------------------------------------
+ Limit
+   ->  Append
+         ->  Hash Join
+               Hash Cond: (p1_1.c = p2_1.c)
+               ->  Seq Scan on pht1_p1 p1_1
+               ->  Hash
+                     ->  Seq Scan on pht1_p1 p2_1
+         ->  Hash Join
+               Hash Cond: (p1_2.c = p2_2.c)
+               ->  Seq Scan on pht1_p2 p1_2
+               ->  Hash
+                     ->  Seq Scan on pht1_p2 p2_2
+         ->  Hash Join
+               Hash Cond: (p1_3.c = p2_3.c)
+               ->  Seq Scan on pht1_p3 p1_3
+               ->  Hash
+                     ->  Seq Scan on pht1_p3 p2_3
+(17 rows)
+
+SET max_parallel_workers_per_gather = 1;
+SET debug_parallel_query = on;
+-- Partial paths should also be smart enough to employ limits
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
+ Gather
+   Workers Planned: 1
+   Single Copy: true
+   ->  Limit
+         ->  Append
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p1 p1_1
+                     ->  Memoize
+                           Cache Key: p1_1.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
+                                 Index Cond: (c = p1_1.c)
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p2 p1_2
+                     ->  Memoize
+                           Cache Key: p1_2.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
+                                 Index Cond: (c = p1_2.c)
+               ->  Nested Loop
+                     ->  Seq Scan on pht1_p3 p1_3
+                     ->  Memoize
+                           Cache Key: p1_3.c
+                           Cache Mode: logical
+                           ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
+                                 Index Cond: (c = p1_3.c)
+(26 rows)
+
+RESET debug_parallel_query;
+-- Remove indexes from the partitioned table and its partitions
+DROP INDEX pht1_c_idx CASCADE;
 -- cleanup
 DROP TABLE fract_t;
 RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index caa8fe70a05..96962817ed4 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1472,18 +1472,17 @@ select t1.unique1 from tenk1 t1
 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
    union all
 (values(1)) limit 1;
-                       QUERY PLAN                       
---------------------------------------------------------
+                             QUERY PLAN                              
+---------------------------------------------------------------------
  Limit
    ->  Append
          ->  Nested Loop
-               Join Filter: (t1.tenthous = t2.tenthous)
-               ->  Seq Scan on tenk1 t1
-               ->  Materialize
-                     ->  Seq Scan on tenk2 t2
-                           Filter: (thousand = 0)
+               ->  Seq Scan on tenk2 t2
+                     Filter: (thousand = 0)
+               ->  Index Scan using tenk1_thous_tenthous on tenk1 t1
+                     Index Cond: (tenthous = t2.tenthous)
          ->  Result
-(9 rows)
+(8 rows)
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
 explain (costs off)
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 26b8e3d063f..b76c5451001 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -1225,6 +1225,27 @@ SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id AS
 EXPLAIN (COSTS OFF)
 SELECT x.id, y.id FROM fract_t x LEFT JOIN fract_t y USING (id) ORDER BY x.id DESC LIMIT 10;
 
+--
+-- Test Append's fractional paths
+--
+
+CREATE INDEX pht1_c_idx ON pht1(c);
+-- SeqScan might be the best choice if we need one single tuple
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
+-- Increase number of tuples requested and an IndexScan will be chosen
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+-- If almost all the data should be fetched - prefer SeqScan
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
+
+SET max_parallel_workers_per_gather = 1;
+SET debug_parallel_query = on;
+-- Partial paths should also be smart enough to employ limits
+EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
+RESET debug_parallel_query;
+
+-- Remove indexes from the partitioned table and its partitions
+DROP INDEX pht1_c_idx CASCADE;
+
 -- cleanup
 DROP TABLE fract_t;
 
-- 
2.39.5 (Apple Git-154)

#25

Nikita Malakhov

hukutoc@gmail.com

10 months ago

In reply to: Alexander Korotkov (#24)

Re: Considering fractional paths in Append node

Hi!

No objections. Alexander, thank you!

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/