Consider explicit incremental sort for Append and MergeAppend
For ordered Append or MergeAppend, it seems that incremental sort is
currently not considered when injecting an explicit sort into subpaths
that are not sufficiently ordered. For instance:
set enable_seqscan to off;
explain (costs off)
select hundred as x, two as y from tenk1
union all
select thousand as x, tenthous as y from tenk1
order by x, y;
QUERY PLAN
-------------------------------------------------------------------
Merge Append
Sort Key: tenk1.hundred, tenk1.two
-> Sort
Sort Key: tenk1.hundred, tenk1.two
-> Index Scan using tenk1_hundred on tenk1
-> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
(6 rows)
Similar to what we do in 828e94c9d, I think we can also consider using
explicit incremental sort for ordered Append or MergeAppend. Here is
a patch doing that. With this patch, the plan above changes to:
QUERY PLAN
-------------------------------------------------------------------
Merge Append
Sort Key: tenk1.hundred, tenk1.two
-> Incremental Sort
Sort Key: tenk1.hundred, tenk1.two
Presorted Key: tenk1.hundred
-> Index Scan using tenk1_hundred on tenk1
-> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
(7 rows)
This targets v19.
Thanks
Richard
Attachments:
v1-0001-Consider-explicit-incremental-sort-for-Append-and.patchapplication/octet-stream; name=v1-0001-Consider-explicit-incremental-sort-for-Append-and.patchDownload
From 2f4098a52f6fd7439aebe4447eaeddc40f0e741f Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Fri, 9 May 2025 16:44:20 +0900
Subject: [PATCH v1] Consider explicit incremental sort for Append and
MergeAppend
For an ordered Append or MergeAppend, we need to inject an explicit
sort into any subpath that is not already well enough ordered.
Currently, only explicit full sorts are considered; incremental sorts
are not yet taken into account.
In this patch, for subpaths of an ordered Append or MergeAppend, we
choose to use explicit incremental sort if it is enabled and there are
presorted keys.
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. In addition, the
current cost model tends to favor incremental sort as being cheaper
than full sort in the presence of presorted keys, making it reasonable
not to consider full sort in such cases.
No backpatch as this could result in plan changes.
---
src/backend/optimizer/path/costsize.c | 54 ++++++++++----
src/backend/optimizer/plan/createplan.c | 66 +++++++++++++++--
src/backend/optimizer/util/pathnode.c | 73 ++++++++++++-------
src/include/optimizer/cost.h | 2 +-
.../regress/expected/incremental_sort.out | 40 ++++++++++
src/test/regress/expected/inherit.out | 10 ++-
src/test/regress/sql/incremental_sort.sql | 24 ++++++
7 files changed, 217 insertions(+), 52 deletions(-)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3d44815ed5a..1f04a2c182c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2247,7 +2247,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
* Determines and returns the cost of an Append node.
*/
void
-cost_append(AppendPath *apath)
+cost_append(AppendPath *apath, PlannerInfo *root)
{
ListCell *l;
@@ -2309,26 +2309,52 @@ cost_append(AppendPath *apath)
foreach(l, apath->subpaths)
{
Path *subpath = (Path *) lfirst(l);
- Path sort_path; /* dummy for result of cost_sort */
+ int presorted_keys;
+ Path sort_path; /* dummy for result of
+ * cost_sort/cost_incremental_sort */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
/*
* We'll need to insert a Sort node, so include costs for
- * that. We can use the parent's LIMIT if any, since we
+ * that. We choose to use incremental sort if it is
+ * enabled and there are presorted keys; otherwise we use
+ * full sort.
+ *
+ * We can use the parent's LIMIT if any, since we
* certainly won't pull more than that many tuples from
* any child.
*/
- cost_sort(&sort_path,
- NULL, /* doesn't currently need root */
- pathkeys,
- subpath->disabled_nodes,
- subpath->total_cost,
- subpath->rows,
- subpath->pathtarget->width,
- 0.0,
- work_mem,
- apath->limit_tuples);
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ cost_incremental_sort(&sort_path,
+ root,
+ pathkeys,
+ presorted_keys,
+ subpath->disabled_nodes,
+ subpath->startup_cost,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ apath->limit_tuples);
+ }
+ else
+ {
+ cost_sort(&sort_path,
+ root,
+ pathkeys,
+ subpath->disabled_nodes,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ apath->limit_tuples);
+ }
+
subpath = &sort_path;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 4ad30b7627e..4a18b08e3a4 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1318,6 +1318,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int presorted_keys;
/*
* Compute sort column info, and adjust subplan's tlist as needed.
@@ -1353,14 +1354,38 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
- Sort *sort = make_sort(subplan, numsortkeys,
+ Plan *sort_plan;
+
+ /*
+ * We choose to use incremental sort if it is enabled and
+ * there are presorted keys; otherwise we use full sort.
+ */
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ sort_plan = (Plan *)
+ make_incrementalsort(subplan, numsortkeys, presorted_keys,
sortColIdx, sortOperators,
collations, nullsFirst);
- label_sort_with_costsize(root, sort, best_path->limit_tuples);
- subplan = (Plan *) sort;
+ label_incrementalsort_with_costsize(root,
+ (IncrementalSort *) sort_plan,
+ pathkeys,
+ best_path->limit_tuples);
+ }
+ else
+ {
+ sort_plan = (Plan *) make_sort(subplan, numsortkeys,
+ sortColIdx, sortOperators,
+ collations, nullsFirst);
+
+ label_sort_with_costsize(root, (Sort *) sort_plan,
+ best_path->limit_tuples);
+ }
+
+ subplan = sort_plan;
}
}
@@ -1491,6 +1516,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int presorted_keys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1525,14 +1551,38 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
- Sort *sort = make_sort(subplan, numsortkeys,
+ Plan *sort_plan;
+
+ /*
+ * We choose to use incremental sort if it is enabled and there
+ * are presorted keys; otherwise we use full sort.
+ */
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ sort_plan = (Plan *)
+ make_incrementalsort(subplan, numsortkeys, presorted_keys,
sortColIdx, sortOperators,
collations, nullsFirst);
- label_sort_with_costsize(root, sort, best_path->limit_tuples);
- subplan = (Plan *) sort;
+ label_incrementalsort_with_costsize(root,
+ (IncrementalSort *) sort_plan,
+ pathkeys,
+ best_path->limit_tuples);
+ }
+ else
+ {
+ sort_plan = (Plan *) make_sort(subplan, numsortkeys,
+ sortColIdx, sortOperators,
+ collations, nullsFirst);
+
+ label_sort_with_costsize(root, (Sort *) sort_plan,
+ best_path->limit_tuples);
+ }
+
+ subplan = sort_plan;
}
subplans = lappend(subplans, subplan);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e0192d4a491..9cc602788ea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1404,12 +1404,12 @@ create_append_path(PlannerInfo *root,
pathnode->path.total_cost = child->total_cost;
}
else
- cost_append(pathnode);
+ cost_append(pathnode, root);
/* Must do this last, else cost_append complains */
pathnode->path.pathkeys = child->pathkeys;
}
else
- cost_append(pathnode);
+ cost_append(pathnode, root);
/* If the caller provided a row estimate, override the computed value. */
if (rows >= 0)
@@ -1515,6 +1515,9 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int presorted_keys;
+ Path sort_path; /* dummy for result of
+ * cost_sort/cost_incremental_sort */
/* All child paths should be unparameterized */
Assert(bms_is_empty(PATH_REQ_OUTER(subpath)));
@@ -1523,32 +1526,52 @@ create_merge_append_path(PlannerInfo *root,
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
- /* Subpath is adequately ordered, we won't need to sort it */
- input_disabled_nodes += subpath->disabled_nodes;
- input_startup_cost += subpath->startup_cost;
- input_total_cost += subpath->total_cost;
- }
- else
- {
- /* We'll need to insert a Sort node, so include cost for that */
- Path sort_path; /* dummy for result of cost_sort */
+ /*
+ * We'll need to insert a Sort node, so include costs for that. We
+ * choose to use incremental sort if it is enabled and there are
+ * presorted keys; otherwise we use full sort.
+ *
+ * We can use the parent's LIMIT if any, since we certainly won't
+ * pull more than that many tuples from any child.
+ */
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ cost_incremental_sort(&sort_path,
+ root,
+ pathkeys,
+ presorted_keys,
+ subpath->disabled_nodes,
+ subpath->startup_cost,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ pathnode->limit_tuples);
+ }
+ else
+ {
+ cost_sort(&sort_path,
+ root,
+ pathkeys,
+ subpath->disabled_nodes,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ pathnode->limit_tuples);
+ }
- cost_sort(&sort_path,
- root,
- pathkeys,
- subpath->disabled_nodes,
- subpath->total_cost,
- subpath->rows,
- subpath->pathtarget->width,
- 0.0,
- work_mem,
- pathnode->limit_tuples);
- input_disabled_nodes += sort_path.disabled_nodes;
- input_startup_cost += sort_path.startup_cost;
- input_total_cost += sort_path.total_cost;
+ subpath = &sort_path;
}
+
+ input_disabled_nodes += subpath->disabled_nodes;
+ input_startup_cost += subpath->startup_cost;
+ input_total_cost += subpath->total_cost;
}
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d397fe27dc1..b523bcda8f3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -118,7 +118,7 @@ extern void cost_incremental_sort(Path *path,
Cost input_startup_cost, Cost input_total_cost,
double input_tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
-extern void cost_append(AppendPath *apath);
+extern void cost_append(AppendPath *apath, PlannerInfo *root);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
int input_disabled_nodes,
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index b00219643b9..5a1dd9fc022 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1722,3 +1722,43 @@ order by t1.four, t1.two limit 1;
-> Seq Scan on tenk1 t2
(12 rows)
+--
+-- Test incremental sort for Append/MergeAppend
+--
+create table prt_tbl (a int, b int) partition by range (a);
+create table prt_tbl_1 partition of prt_tbl for values from (0) to (100);
+create table prt_tbl_2 partition of prt_tbl for values from (100) to (200);
+insert into prt_tbl select i%200, i from generate_series(1,1000)i;
+create index on prt_tbl_1(a);
+create index on prt_tbl_2(a, b);
+analyze prt_tbl;
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+-- Ensure we get an incremental sort for the subpath of Append
+explain (costs off) select * from prt_tbl order by a, b;
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Incremental Sort
+ Sort Key: prt_tbl_1.a, prt_tbl_1.b
+ Presorted Key: prt_tbl_1.a
+ -> Index Scan using prt_tbl_1_a_idx on prt_tbl_1
+ -> Index Only Scan using prt_tbl_2_a_b_idx on prt_tbl_2
+(6 rows)
+
+-- Ensure we get an incremental sort for the subpath of MergeAppend
+explain (costs off) select * from prt_tbl_1 union all select * from prt_tbl_2 order by a, b;
+ QUERY PLAN
+------------------------------------------------------------
+ Merge Append
+ Sort Key: prt_tbl_1.a, prt_tbl_1.b
+ -> Incremental Sort
+ Sort Key: prt_tbl_1.a, prt_tbl_1.b
+ Presorted Key: prt_tbl_1.a
+ -> Index Scan using prt_tbl_1_a_idx on prt_tbl_1
+ -> Index Only Scan using prt_tbl_2_a_b_idx on prt_tbl_2
+(7 rows)
+
+reset enable_bitmapscan;
+reset enable_seqscan;
+drop table prt_tbl;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index f9b0c415cfd..2045daf9778 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1898,10 +1898,11 @@ ORDER BY thousand, tenthous;
Merge Append
Sort Key: tenk1.thousand, tenk1.tenthous
-> Index Only Scan using tenk1_thous_tenthous on tenk1
- -> Sort
+ -> Incremental Sort
Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
-> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
-(6 rows)
+(7 rows)
explain (costs off)
SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
@@ -1982,10 +1983,11 @@ ORDER BY x, y;
Merge Append
Sort Key: a.thousand, a.tenthous
-> Index Only Scan using tenk1_thous_tenthous on tenk1 a
- -> Sort
+ -> Incremental Sort
Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
-> Index Only Scan using tenk1_unique2 on tenk1 b
-(6 rows)
+(7 rows)
-- exercise rescan code path via a repeatedly-evaluated subquery
explain (costs off)
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index f1f8fae5654..bbe658a7588 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -298,3 +298,27 @@ explain (costs off)
select * from
(select * from tenk1 order by four) t1 join tenk1 t2 on t1.four = t2.four and t1.two = t2.two
order by t1.four, t1.two limit 1;
+
+--
+-- Test incremental sort for Append/MergeAppend
+--
+create table prt_tbl (a int, b int) partition by range (a);
+create table prt_tbl_1 partition of prt_tbl for values from (0) to (100);
+create table prt_tbl_2 partition of prt_tbl for values from (100) to (200);
+insert into prt_tbl select i%200, i from generate_series(1,1000)i;
+create index on prt_tbl_1(a);
+create index on prt_tbl_2(a, b);
+analyze prt_tbl;
+
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+-- Ensure we get an incremental sort for the subpath of Append
+explain (costs off) select * from prt_tbl order by a, b;
+
+-- Ensure we get an incremental sort for the subpath of MergeAppend
+explain (costs off) select * from prt_tbl_1 union all select * from prt_tbl_2 order by a, b;
+
+reset enable_bitmapscan;
+reset enable_seqscan;
+drop table prt_tbl;
--
2.43.0
On 12/5/2025 11:29, Richard Guo wrote:
For ordered Append or MergeAppend, it seems that incremental sort is
currently not considered when injecting an explicit sort into subpaths
that are not sufficiently ordered. For instance:
Thanks for doing this job.
I have reviewed your patch and want to put here some thoughts:
0. The patch looks simple enough to be safe. I passed through the code
and found no issues except comments (see thought No.1). I will be okay
if you commit it.
1. I'm not very happy with the fact that it strengthens the cost_append
connection with create_append_plan. At least, there should be
cross-reference comments to let developers know if they change something
inside one of these functions.
2. IncrementalSort is not always more effective - it depends on the
column's number of groups. In my experience, a non-cost-based decision
one day meets the problematic case, and the people who stick with it are
much more confused than in the case when planner decision connected to
the costings - they trust the cost model or the cost model tuned by GUCs.
3. The functions label_incrementalsort_with_costsize and
label_sort_with_costsize are not ideal architectural decisions.
Attempting to improve sort / incremental sort cost functions, I am
always stuck in the absence of some necessary data from the sorting path
and RelOptInfo at this stage.
As an alternative, you may check the approach of [1]/messages/by-id/25d6a2cd161673d51373b7e07e6d9dd6@postgrespro.ru, where we decide
how to adjust a subpath to MergeAppend needs inside
generate_orderedappend_paths using a cost-based approach.
Also, would you have a chance to look into the [1,2]? It seems like a
further improvement, bringing a bit closer optimality of appended path
choice to single-table scan choice.
[1]: /messages/by-id/25d6a2cd161673d51373b7e07e6d9dd6@postgrespro.ru
/messages/by-id/25d6a2cd161673d51373b7e07e6d9dd6@postgrespro.ru
[2]: /messages/by-id/f0206ef2-6b5a-4d07-8770-cfa7cd30f685@gmail.com
/messages/by-id/f0206ef2-6b5a-4d07-8770-cfa7cd30f685@gmail.com
--
regards, Andrei Lepikhov
On Thu, May 15, 2025 at 9:03 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
2. IncrementalSort is not always more effective - it depends on the
column's number of groups. In my experience, a non-cost-based decision
one day meets the problematic case, and the people who stick with it are
much more confused than in the case when planner decision connected to
the costings - they trust the cost model or the cost model tuned by GUCs.
I wonder if this could be fixed in nodeIncrementalSort.c. I think it's
a problem to rely on planner estimates because planner estimates of
the # of groups are quite unreliable. But at runtime, we can notice
whether the groups are small or large and adjust the algorithm
accordingly. In fact, it looks like we already have some logic for
that:
/*
* Sorting many small groups with tuplesort is inefficient. In order to
* cope with this problem we don't start a new group until the current one
* contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
* means we can't assume small groups of tuples all have the same prefix keys.)
* When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
* for the new group as soon as we've met our bound to avoid fetching more
* tuples than we absolutely have to fetch.
*/
#define DEFAULT_MIN_GROUP_SIZE 32
But maybe this logic needs to be further refined somehow. I can't
shake the feeling that it's going to be really hard to have every
place that uses incremental sort decide whether to use an incremental
sort or a regular sort -- we should try to get to a place where it's
safe to use an incremental sort when it's possible without worrying
that it might actually be slower.
Or maybe that's not possible for some reason, but it is not clear to
me what that reason might be.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, May 19, 2025 at 10:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, May 15, 2025 at 9:03 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
2. IncrementalSort is not always more effective - it depends on the
column's number of groups. In my experience, a non-cost-based decision
one day meets the problematic case, and the people who stick with it are
much more confused than in the case when planner decision connected to
the costings - they trust the cost model or the cost model tuned by GUCs.
I wonder if this could be fixed in nodeIncrementalSort.c. I think it's
a problem to rely on planner estimates because planner estimates of
the # of groups are quite unreliable. But at runtime, we can notice
whether the groups are small or large and adjust the algorithm
accordingly. In fact, it looks like we already have some logic for
that:/*
* Sorting many small groups with tuplesort is inefficient. In order to
* cope with this problem we don't start a new group until the current one
* contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
* means we can't assume small groups of tuples all have the same prefix keys.)
* When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start looking
* for the new group as soon as we've met our bound to avoid fetching more
* tuples than we absolutely have to fetch.
*/
#define DEFAULT_MIN_GROUP_SIZE 32But maybe this logic needs to be further refined somehow. I can't
shake the feeling that it's going to be really hard to have every
place that uses incremental sort decide whether to use an incremental
sort or a regular sort -- we should try to get to a place where it's
safe to use an incremental sort when it's possible without worrying
that it might actually be slower.
Agreed. In some cases, we currently don't have the infrastructure to
consider both incremental sort and full sort and compare their costs —
for example, when inserting explicit sorts for mergejoins, or, as in
this case, for Append/MergeAppend.
Besides, I think the planner currently assumes that incremental sort
is always faster than full sort when there are presorted keys, and
this premise has been applied in various parts of the code. I checked
all the callers of create_incremental_sort_path, and they all follow
the logic that if there are presorted keys, only incremental sort is
considered.
Also, to understand how incremental sort performs in the worst case, I
ran the following test.
create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,1000000)x;
create index on ab (a);
vacuum analyze ab;
In this table, group 0 has 999000 rows, and the remaining groups
1-1000 have just 1 row each.
-- incremental sort
explain (analyze, costs off) select * from ab order by a,b;
QUERY PLAN
------------------------------------------------------------------------------------------------
Incremental Sort (actual time=585.093..714.589 rows=1000000.00 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 33 Sort Method: quicksort Average Memory: 26kB
Peak Memory: 26kB
Pre-sorted Groups: 1 Sort Method: external merge Average Disk:
17680kB Peak Disk: 17680kB
Buffers: shared hit=5273, temp read=2210 written=2222
-> Index Scan using ab_a_idx on ab (actual time=0.192..330.289
rows=1000000.00 loops=1)
Index Searches: 1
Buffers: shared hit=5273
Planning Time: 0.354 ms
Execution Time: 759.693 ms
(11 rows)
-- full sort
explain (analyze, costs off) select * from ab order by a,b;
QUERY PLAN
--------------------------------------------------------------------------------------------
Sort (actual time=570.755..831.825 rows=1000000.00 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
Buffers: shared hit=5273, temp read=2213 written=2225
-> Index Scan using ab_a_idx on ab (actual time=0.187..327.701
rows=1000000.00 loops=1)
Index Searches: 1
Buffers: shared hit=5273
Planning Time: 0.304 ms
Execution Time: 877.112 ms
(9 rows)
So with the same subpath, incremental sort still outperforms full sort
even if there is a big skew in the number of rows per pre-sorted group
(which is a bit surprising to me).
So, I think it's generally safe to use incremental sort whenever
possible. There might be some corner cases where incremental sort is
slower than full sort, and I think it would be best to address those
in nodeIncrementalSort.c, as Robert suggested.
Thanks
Richard
On Thu, May 22, 2025 at 6:07 PM Richard Guo <guofenglinux@gmail.com> wrote:
So, I think it's generally safe to use incremental sort whenever
possible. There might be some corner cases where incremental sort is
slower than full sort, and I think it would be best to address those
in nodeIncrementalSort.c, as Robert suggested.
Here's the latest rebase. I'm planning to push it soon, barring any
objections.
Thanks
Richard
Attachments:
v2-0001-Consider-explicit-incremental-sort-for-Append-and.patchapplication/octet-stream; name=v2-0001-Consider-explicit-incremental-sort-for-Append-and.patchDownload
From dfd92b209db122ce6f16dc97cd25919643b3aa8e Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Fri, 9 May 2025 16:44:20 +0900
Subject: [PATCH v2] Consider explicit incremental sort for Append and
MergeAppend
For an ordered Append or MergeAppend, we need to inject an explicit
sort into any subpath that is not already well enough ordered.
Currently, only explicit full sorts are considered; incremental sorts
are not yet taken into account.
In this patch, for subpaths of an ordered Append or MergeAppend, we
choose to use explicit incremental sort if it is enabled and there are
presorted keys.
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. In addition, the
current cost model tends to favor incremental sort as being cheaper
than full sort in the presence of presorted keys, making it reasonable
not to consider full sort in such cases.
No backpatch as this could result in plan changes.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4_V7a2enTR+T3pOY_YZ-FU8ZsFYym2swOz4jNMqmSgyuw@mail.gmail.com
---
src/backend/optimizer/path/costsize.c | 54 ++++++++++----
src/backend/optimizer/plan/createplan.c | 66 +++++++++++++++--
src/backend/optimizer/util/pathnode.c | 73 ++++++++++++-------
src/include/optimizer/cost.h | 2 +-
.../regress/expected/incremental_sort.out | 40 ++++++++++
src/test/regress/expected/inherit.out | 10 ++-
src/test/regress/sql/incremental_sort.sql | 24 ++++++
7 files changed, 217 insertions(+), 52 deletions(-)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3d44815ed5a..1f04a2c182c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2247,7 +2247,7 @@ append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
* Determines and returns the cost of an Append node.
*/
void
-cost_append(AppendPath *apath)
+cost_append(AppendPath *apath, PlannerInfo *root)
{
ListCell *l;
@@ -2309,26 +2309,52 @@ cost_append(AppendPath *apath)
foreach(l, apath->subpaths)
{
Path *subpath = (Path *) lfirst(l);
- Path sort_path; /* dummy for result of cost_sort */
+ int presorted_keys;
+ Path sort_path; /* dummy for result of
+ * cost_sort/cost_incremental_sort */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
/*
* We'll need to insert a Sort node, so include costs for
- * that. We can use the parent's LIMIT if any, since we
+ * that. We choose to use incremental sort if it is
+ * enabled and there are presorted keys; otherwise we use
+ * full sort.
+ *
+ * We can use the parent's LIMIT if any, since we
* certainly won't pull more than that many tuples from
* any child.
*/
- cost_sort(&sort_path,
- NULL, /* doesn't currently need root */
- pathkeys,
- subpath->disabled_nodes,
- subpath->total_cost,
- subpath->rows,
- subpath->pathtarget->width,
- 0.0,
- work_mem,
- apath->limit_tuples);
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ cost_incremental_sort(&sort_path,
+ root,
+ pathkeys,
+ presorted_keys,
+ subpath->disabled_nodes,
+ subpath->startup_cost,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ apath->limit_tuples);
+ }
+ else
+ {
+ cost_sort(&sort_path,
+ root,
+ pathkeys,
+ subpath->disabled_nodes,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ apath->limit_tuples);
+ }
+
subpath = &sort_path;
}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0b61aef962c..8a9f1d7a943 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1318,6 +1318,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int presorted_keys;
/*
* Compute sort column info, and adjust subplan's tlist as needed.
@@ -1353,14 +1354,38 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
- Sort *sort = make_sort(subplan, numsortkeys,
+ Plan *sort_plan;
+
+ /*
+ * We choose to use incremental sort if it is enabled and
+ * there are presorted keys; otherwise we use full sort.
+ */
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ sort_plan = (Plan *)
+ make_incrementalsort(subplan, numsortkeys, presorted_keys,
sortColIdx, sortOperators,
collations, nullsFirst);
- label_sort_with_costsize(root, sort, best_path->limit_tuples);
- subplan = (Plan *) sort;
+ label_incrementalsort_with_costsize(root,
+ (IncrementalSort *) sort_plan,
+ pathkeys,
+ best_path->limit_tuples);
+ }
+ else
+ {
+ sort_plan = (Plan *) make_sort(subplan, numsortkeys,
+ sortColIdx, sortOperators,
+ collations, nullsFirst);
+
+ label_sort_with_costsize(root, (Sort *) sort_plan,
+ best_path->limit_tuples);
+ }
+
+ subplan = sort_plan;
}
}
@@ -1491,6 +1516,7 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
Oid *sortOperators;
Oid *collations;
bool *nullsFirst;
+ int presorted_keys;
/* Build the child plan */
/* Must insist that all children return the same tlist */
@@ -1525,14 +1551,38 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
numsortkeys * sizeof(bool)) == 0);
/* Now, insert a Sort node if subplan isn't sufficiently ordered */
- if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
- Sort *sort = make_sort(subplan, numsortkeys,
+ Plan *sort_plan;
+
+ /*
+ * We choose to use incremental sort if it is enabled and there
+ * are presorted keys; otherwise we use full sort.
+ */
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ sort_plan = (Plan *)
+ make_incrementalsort(subplan, numsortkeys, presorted_keys,
sortColIdx, sortOperators,
collations, nullsFirst);
- label_sort_with_costsize(root, sort, best_path->limit_tuples);
- subplan = (Plan *) sort;
+ label_incrementalsort_with_costsize(root,
+ (IncrementalSort *) sort_plan,
+ pathkeys,
+ best_path->limit_tuples);
+ }
+ else
+ {
+ sort_plan = (Plan *) make_sort(subplan, numsortkeys,
+ sortColIdx, sortOperators,
+ collations, nullsFirst);
+
+ label_sort_with_costsize(root, (Sort *) sort_plan,
+ best_path->limit_tuples);
+ }
+
+ subplan = sort_plan;
}
subplans = lappend(subplans, subplan);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e0192d4a491..9cc602788ea 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1404,12 +1404,12 @@ create_append_path(PlannerInfo *root,
pathnode->path.total_cost = child->total_cost;
}
else
- cost_append(pathnode);
+ cost_append(pathnode, root);
/* Must do this last, else cost_append complains */
pathnode->path.pathkeys = child->pathkeys;
}
else
- cost_append(pathnode);
+ cost_append(pathnode, root);
/* If the caller provided a row estimate, override the computed value. */
if (rows >= 0)
@@ -1515,6 +1515,9 @@ create_merge_append_path(PlannerInfo *root,
foreach(l, subpaths)
{
Path *subpath = (Path *) lfirst(l);
+ int presorted_keys;
+ Path sort_path; /* dummy for result of
+ * cost_sort/cost_incremental_sort */
/* All child paths should be unparameterized */
Assert(bms_is_empty(PATH_REQ_OUTER(subpath)));
@@ -1523,32 +1526,52 @@ create_merge_append_path(PlannerInfo *root,
pathnode->path.parallel_safe = pathnode->path.parallel_safe &&
subpath->parallel_safe;
- if (pathkeys_contained_in(pathkeys, subpath->pathkeys))
+ if (!pathkeys_count_contained_in(pathkeys, subpath->pathkeys,
+ &presorted_keys))
{
- /* Subpath is adequately ordered, we won't need to sort it */
- input_disabled_nodes += subpath->disabled_nodes;
- input_startup_cost += subpath->startup_cost;
- input_total_cost += subpath->total_cost;
- }
- else
- {
- /* We'll need to insert a Sort node, so include cost for that */
- Path sort_path; /* dummy for result of cost_sort */
+ /*
+ * We'll need to insert a Sort node, so include costs for that. We
+ * choose to use incremental sort if it is enabled and there are
+ * presorted keys; otherwise we use full sort.
+ *
+ * We can use the parent's LIMIT if any, since we certainly won't
+ * pull more than that many tuples from any child.
+ */
+ if (enable_incremental_sort && presorted_keys > 0)
+ {
+ cost_incremental_sort(&sort_path,
+ root,
+ pathkeys,
+ presorted_keys,
+ subpath->disabled_nodes,
+ subpath->startup_cost,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ pathnode->limit_tuples);
+ }
+ else
+ {
+ cost_sort(&sort_path,
+ root,
+ pathkeys,
+ subpath->disabled_nodes,
+ subpath->total_cost,
+ subpath->rows,
+ subpath->pathtarget->width,
+ 0.0,
+ work_mem,
+ pathnode->limit_tuples);
+ }
- cost_sort(&sort_path,
- root,
- pathkeys,
- subpath->disabled_nodes,
- subpath->total_cost,
- subpath->rows,
- subpath->pathtarget->width,
- 0.0,
- work_mem,
- pathnode->limit_tuples);
- input_disabled_nodes += sort_path.disabled_nodes;
- input_startup_cost += sort_path.startup_cost;
- input_total_cost += sort_path.total_cost;
+ subpath = &sort_path;
}
+
+ input_disabled_nodes += subpath->disabled_nodes;
+ input_startup_cost += subpath->startup_cost;
+ input_total_cost += subpath->total_cost;
}
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index d397fe27dc1..b523bcda8f3 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -118,7 +118,7 @@ extern void cost_incremental_sort(Path *path,
Cost input_startup_cost, Cost input_total_cost,
double input_tuples, int width, Cost comparison_cost, int sort_mem,
double limit_tuples);
-extern void cost_append(AppendPath *apath);
+extern void cost_append(AppendPath *apath, PlannerInfo *root);
extern void cost_merge_append(Path *path, PlannerInfo *root,
List *pathkeys, int n_streams,
int input_disabled_nodes,
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index b00219643b9..5a1dd9fc022 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1722,3 +1722,43 @@ order by t1.four, t1.two limit 1;
-> Seq Scan on tenk1 t2
(12 rows)
+--
+-- Test incremental sort for Append/MergeAppend
+--
+create table prt_tbl (a int, b int) partition by range (a);
+create table prt_tbl_1 partition of prt_tbl for values from (0) to (100);
+create table prt_tbl_2 partition of prt_tbl for values from (100) to (200);
+insert into prt_tbl select i%200, i from generate_series(1,1000)i;
+create index on prt_tbl_1(a);
+create index on prt_tbl_2(a, b);
+analyze prt_tbl;
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+-- Ensure we get an incremental sort for the subpath of Append
+explain (costs off) select * from prt_tbl order by a, b;
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Incremental Sort
+ Sort Key: prt_tbl_1.a, prt_tbl_1.b
+ Presorted Key: prt_tbl_1.a
+ -> Index Scan using prt_tbl_1_a_idx on prt_tbl_1
+ -> Index Only Scan using prt_tbl_2_a_b_idx on prt_tbl_2
+(6 rows)
+
+-- Ensure we get an incremental sort for the subpath of MergeAppend
+explain (costs off) select * from prt_tbl_1 union all select * from prt_tbl_2 order by a, b;
+ QUERY PLAN
+------------------------------------------------------------
+ Merge Append
+ Sort Key: prt_tbl_1.a, prt_tbl_1.b
+ -> Incremental Sort
+ Sort Key: prt_tbl_1.a, prt_tbl_1.b
+ Presorted Key: prt_tbl_1.a
+ -> Index Scan using prt_tbl_1_a_idx on prt_tbl_1
+ -> Index Only Scan using prt_tbl_2_a_b_idx on prt_tbl_2
+(7 rows)
+
+reset enable_bitmapscan;
+reset enable_seqscan;
+drop table prt_tbl;
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 78dead65325..5b5055babdc 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1898,10 +1898,11 @@ ORDER BY thousand, tenthous;
Merge Append
Sort Key: tenk1.thousand, tenk1.tenthous
-> Index Only Scan using tenk1_thous_tenthous on tenk1
- -> Sort
+ -> Incremental Sort
Sort Key: tenk1_1.thousand, tenk1_1.thousand
+ Presorted Key: tenk1_1.thousand
-> Index Only Scan using tenk1_thous_tenthous on tenk1 tenk1_1
-(6 rows)
+(7 rows)
explain (costs off)
SELECT thousand, tenthous, thousand+tenthous AS x FROM tenk1
@@ -1982,10 +1983,11 @@ ORDER BY x, y;
Merge Append
Sort Key: a.thousand, a.tenthous
-> Index Only Scan using tenk1_thous_tenthous on tenk1 a
- -> Sort
+ -> Incremental Sort
Sort Key: b.unique2, b.unique2
+ Presorted Key: b.unique2
-> Index Only Scan using tenk1_unique2 on tenk1 b
-(6 rows)
+(7 rows)
-- exercise rescan code path via a repeatedly-evaluated subquery
explain (costs off)
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index f1f8fae5654..bbe658a7588 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -298,3 +298,27 @@ explain (costs off)
select * from
(select * from tenk1 order by four) t1 join tenk1 t2 on t1.four = t2.four and t1.two = t2.two
order by t1.four, t1.two limit 1;
+
+--
+-- Test incremental sort for Append/MergeAppend
+--
+create table prt_tbl (a int, b int) partition by range (a);
+create table prt_tbl_1 partition of prt_tbl for values from (0) to (100);
+create table prt_tbl_2 partition of prt_tbl for values from (100) to (200);
+insert into prt_tbl select i%200, i from generate_series(1,1000)i;
+create index on prt_tbl_1(a);
+create index on prt_tbl_2(a, b);
+analyze prt_tbl;
+
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+-- Ensure we get an incremental sort for the subpath of Append
+explain (costs off) select * from prt_tbl order by a, b;
+
+-- Ensure we get an incremental sort for the subpath of MergeAppend
+explain (costs off) select * from prt_tbl_1 union all select * from prt_tbl_2 order by a, b;
+
+reset enable_bitmapscan;
+reset enable_seqscan;
+drop table prt_tbl;
--
2.43.0
On 19/5/2025 15:21, Robert Haas wrote:
On Thu, May 15, 2025 at 9:03 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
2. IncrementalSort is not always more effective - it depends on the
column's number of groups. In my experience, a non-cost-based decision
one day meets the problematic case, and the people who stick with it are
much more confused than in the case when planner decision connected to
the costings - they trust the cost model or the cost model tuned by GUCs.I wonder if this could be fixed in nodeIncrementalSort.c. I think it's
a problem to rely on planner estimates because planner estimates of
the # of groups are quite unreliable. But at runtime, we can notice
whether the groups are small or large and adjust the algorithm
accordingly. In fact, it looks like we already have some logic for
that:
Yes, the cost estimation issue is the source of my concern. Postgres
estimation of the number of groups is still not ideal [1]/messages/by-id/ba0edc53-4b1f-4c67-92d1-29aeddb36a18@gmail.com. The cost sort
model doesn't take into account the number of columns [2]/messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com - perhaps
something else.
Therefore, if IncrementalSort performs worse in some cases, this change
may result in an increase in user reports.
However, comparing Sort and IncrementalSort (see the attachment), I
recall that IncrementalSort surpasses plain Sort in both corner cases:
tiny groups (beating Sort in terms of memory consumption and smaller
sorting sets) and massive groups (abbreviated keys reduce the number of
compared columns). And it is not easy to invent the worst case where the
Sort node wins.
The only aspect I can't stop thinking about is the cost balance:
sometimes the planner prefers Sort+SeqScan to IncrementalSort+IndexScan
(See explains in the attachment). A non-cost-based choice of
IncrementalSort may result in an Append path with a higher cost, which
can displace it from the path list and lead to another, less effective
choice, such as a less effective IndexScan.
However, I can't support this opinion with any real examples.
In toto, IMO it makes sense to commit this feature now and see how it
behaves.
[1]: /messages/by-id/ba0edc53-4b1f-4c67-92d1-29aeddb36a18@gmail.com
/messages/by-id/ba0edc53-4b1f-4c67-92d1-29aeddb36a18@gmail.com
[2]: /messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com
/messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com
--
regards, Andrei Lepikhov