Why don't we consider explicit Incremental Sort?
While looking at label_sort_with_costsize() due to another issue, I
noticed that we build explicit Sort nodes but not explicit Incremental
Sort nodes. I wonder why this is the case. It seems to me that
Incremental Sorts are preferable in some cases. For example:
create table t (a int, b int);
create index on t (a);
set enable_seqscan to off;
explain (costs off)
select * from t t1 join t t2 on t1.a = t2.a and t1.b = t2.b;
QUERY PLAN
-------------------------------------------------
Merge Join
Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b))
-> Sort
Sort Key: t1.a, t1.b
-> Index Scan using t_a_idx on t t1
-> Sort
Sort Key: t2.a, t2.b
-> Index Scan using t_a_idx on t t2
(8 rows)
For the outer path of a mergejoin, I think we can leverage Incremental
Sort to save effort. For the inner path, though, we cannot do this
because Incremental Sort does not support mark/restore.
It could be argued that what if a mergejoin with an Incremental Sort
as the outer path is selected as the inner path of another mergejoin?
Well, I don't think this is a problem, because mergejoin itself does
not support mark/restore either, and we will add a Material node on
top of it anyway in this case (see final_cost_mergejoin).
label_sort_with_costsize is also called in create_append_plan,
create_merge_append_plan and create_unique_plan. In these cases, we
may also consider using Incremental Sort. But I haven't looked into
these cases.
Any thoughts?
Thanks
Richard
Hi,
On 9/9/24 11:39, Richard Guo wrote:
While looking at label_sort_with_costsize() due to another issue, I
noticed that we build explicit Sort nodes but not explicit Incremental
Sort nodes. I wonder why this is the case. It seems to me that
Incremental Sorts are preferable in some cases. For example:
I think we intentionally added incremental sort ... incrementally ;-)
I don't recall the reasoning exactly, but I think we realized the
incremental sort costing can be a bit shaky (and AFAIK we saw a couple
cases of that reported). So we added it to places where the reasoning
was it wouldn't be a problem and the incremental sort would be a clear
win, e.g. thanks to the "cheap startup" thing.
create table t (a int, b int);
create index on t (a);set enable_seqscan to off;
explain (costs off)
select * from t t1 join t t2 on t1.a = t2.a and t1.b = t2.b;
QUERY PLAN
-------------------------------------------------
Merge Join
Merge Cond: ((t1.a = t2.a) AND (t1.b = t2.b))
-> Sort
Sort Key: t1.a, t1.b
-> Index Scan using t_a_idx on t t1
-> Sort
Sort Key: t2.a, t2.b
-> Index Scan using t_a_idx on t t2
(8 rows)For the outer path of a mergejoin, I think we can leverage Incremental
Sort to save effort. For the inner path, though, we cannot do this
because Incremental Sort does not support mark/restore.It could be argued that what if a mergejoin with an Incremental Sort
as the outer path is selected as the inner path of another mergejoin?
Well, I don't think this is a problem, because mergejoin itself does
not support mark/restore either, and we will add a Material node on
top of it anyway in this case (see final_cost_mergejoin).label_sort_with_costsize is also called in create_append_plan,
create_merge_append_plan and create_unique_plan. In these cases, we
may also consider using Incremental Sort. But I haven't looked into
these cases.Any thoughts?
I think one challenge with this case is that create_mergejoin_plan
creates these Sort plans explicitly - it's not "pathified" so it doesn't
go through the usual cost comparison etc. And it certainly is not the
case that incremental sort would win always, so we'd need to replicate
the cost comparison logic too.
I have not thought about this particular case too much, but how likely
is it that mergejoin will win for this plan in practice? If we consider
a query that only needs a fraction of rows (say, thanks to LIMIT),
aren't we likely to pick a nested loop (which can do incremental sort
for the outer plan)? For joins that need to run to completion it might
be a win, but there's also the higher risk of a poor costing.
I'm not saying it's not worth exploring, I'm just trying recall reasons
why it might not work. Also I don't think it's fundamentally impossible
to do mark/restore for incremental sort - I just haven't tried doing it
because it wasn't necessary. In the worst case we could simply add a
Materialize node on top, no?
regards
--
Tomas Vondra
On Mon, Sep 9, 2024 at 6:22 PM Tomas Vondra <tomas@vondra.me> wrote:
I think we intentionally added incremental sort ... incrementally ;-)
Haha, right.
I think one challenge with this case is that create_mergejoin_plan
creates these Sort plans explicitly - it's not "pathified" so it doesn't
go through the usual cost comparison etc. And it certainly is not the
case that incremental sort would win always, so we'd need to replicate
the cost comparison logic too.I have not thought about this particular case too much, but how likely
is it that mergejoin will win for this plan in practice? If we consider
a query that only needs a fraction of rows (say, thanks to LIMIT),
aren't we likely to pick a nested loop (which can do incremental sort
for the outer plan)? For joins that need to run to completion it might
be a win, but there's also the higher risk of a poor costing.
Yeah, currently mergejoin path always assumes that full sort would be
used on top of the outer path and inner path if they are not already
ordered appropriately. So initial_cost_mergejoin directly charges the
cost of full sort into outer/inner path's cost, without going through
the usual cost comparison with incremental sort.
It seems to me that some parts of our code assume that, for a given
input path that is partially ordered, an incremental sort is always
preferable to a full sort (see commit 4a29eabd1). Am I getting this
correctly? If this is the case, then I think using the following
outer path for the merge join:
-> Incremental Sort
Sort Key: t1.a, t1.b
Presorted Key: t1.a
-> Index Scan using t1_a_idx on t1
... would be an immediate improvement over the current path, which is:
-> Sort
Sort Key: t1.a, t1.b
-> Index Scan using t1_a_idx on t1
The shaky cost estimates for incremental sort that you mentioned are
indeed a concern. Fortunately we have the enable_incremental_sort GUC
already. As in may other parts of the code (such as in the function
add_paths_to_grouping_rel), we can always disable incremental sort to
fall back to a full sort if needed.
I'm not saying it's not worth exploring, I'm just trying recall reasons
why it might not work. Also I don't think it's fundamentally impossible
to do mark/restore for incremental sort - I just haven't tried doing it
because it wasn't necessary. In the worst case we could simply add a
Materialize node on top, no?
Yeah, perhaps we could support mark/restore for incremental sort
someday. This would allow us to apply incremental sort to the inner
path of a merge join too. But if we apply a Material node on top of
the inner due to the lack of mark/restore support for incremental
sort, then we will need to compare two paths:
partially sorted path -> incremental sort -> material
VS.
partially sorted path -> full sort
I think it's hard to tell which is cheaper without a cost comparison,
which we do not have for now.
To help with the discussion, I drafted a WIP patch as attached, which
chooses an incremental sort over a full sort on the given outer path
of a mergejoin whenever possible. There is one ensuing plan diff in
the regression tests (not included in the patch). It seems that some
query in the tests begins to use incremental sort for mergejoin.
Thanks
Richard
Attachments:
v1-0001-Consider-explicit-incremental-sort-for-mergejoins.patchapplication/octet-stream; name=v1-0001-Consider-explicit-incremental-sort-for-mergejoins.patchDownload
From eeec8b83542a722876231ceb14e4f2fffb1fc001 Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Mon, 9 Sep 2024 19:57:53 +0900
Subject: [PATCH v1] Consider explicit incremental sort for mergejoins
---
src/backend/optimizer/path/costsize.c | 60 +++++++++++++++----
src/backend/optimizer/plan/createplan.c | 80 ++++++++++++++++++++++---
2 files changed, 120 insertions(+), 20 deletions(-)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index e1523d15df..210fe5e754 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -3532,7 +3532,8 @@ final_cost_nestloop(PlannerInfo *root, NestPath *path,
* join quals here, except for obtaining the scan selectivity estimate which
* is really essential (but fortunately, use of caching keeps the cost of
* getting that down to something reasonable).
- * We also assume that cost_sort is cheap enough to use here.
+ * We also assume that cost_sort/cost_incremental_sort is cheap enough to use
+ * here.
*
* 'workspace' is to be filled with startup_cost, total_cost, and perhaps
* other data to be used by final_cost_mergejoin
@@ -3569,7 +3570,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
outerendsel,
innerstartsel,
innerendsel;
- Path sort_path; /* dummy for result of cost_sort */
+ Path sort_path; /* dummy for result of
+ * cost_sort/cost_incremental_sort */
/* Protect some assumptions below that rowcounts aren't zero */
if (outer_path_rows <= 0)
@@ -3682,16 +3684,50 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
if (outersortkeys) /* do we need to sort outer? */
{
- cost_sort(&sort_path,
- root,
- outersortkeys,
- outer_path->disabled_nodes,
- outer_path->total_cost,
- outer_path_rows,
- outer_path->pathtarget->width,
- 0.0,
- work_mem,
- -1.0);
+ bool use_incremental_sort = false;
+ int presorted_keys;
+
+ if (enable_incremental_sort)
+ {
+ bool is_sorted PG_USED_FOR_ASSERTS_ONLY;
+
+ is_sorted = pathkeys_count_contained_in(outersortkeys,
+ outer_path->pathkeys,
+ &presorted_keys);
+ Assert(!is_sorted);
+
+ if (presorted_keys > 0)
+ use_incremental_sort = true;
+ }
+
+ if (!use_incremental_sort)
+ {
+ cost_sort(&sort_path,
+ root,
+ outersortkeys,
+ outer_path->disabled_nodes,
+ outer_path->total_cost,
+ outer_path_rows,
+ outer_path->pathtarget->width,
+ 0.0,
+ work_mem,
+ -1.0);
+ }
+ else
+ {
+ cost_incremental_sort(&sort_path,
+ root,
+ outersortkeys,
+ presorted_keys,
+ outer_path->disabled_nodes,
+ outer_path->startup_cost,
+ outer_path->total_cost,
+ outer_path_rows,
+ outer_path->pathtarget->width,
+ 0.0,
+ work_mem,
+ -1.0);
+ }
disabled_nodes += sort_path.disabled_nodes;
startup_cost += sort_path.startup_cost;
startup_cost += (sort_path.total_cost - sort_path.startup_cost)
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bb45ef318f..bff48d266f 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -179,6 +179,8 @@ static void copy_generic_path_info(Plan *dest, Path *src);
static void copy_plan_costsize(Plan *dest, Plan *src);
static void label_sort_with_costsize(PlannerInfo *root, Sort *plan,
double limit_tuples);
+static void label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
+ List *pathkeys, double limit_tuples);
static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
TableSampleClause *tsc);
@@ -4523,12 +4525,47 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->outersortkeys)
{
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
+ Plan *sort_plan;
+ bool use_incremental_sort = false;
+ int presorted_keys;
+
+ if (enable_incremental_sort)
+ {
+ bool is_sorted PG_USED_FOR_ASSERTS_ONLY;
+
+ is_sorted = pathkeys_count_contained_in(best_path->outersortkeys,
+ outer_path->pathkeys,
+ &presorted_keys);
+ Assert(!is_sorted);
+
+ if (presorted_keys > 0)
+ use_incremental_sort = true;
+ }
+
+ if (!use_incremental_sort)
+ {
+ sort_plan = (Plan *)
+ make_sort_from_pathkeys(outer_plan,
+ best_path->outersortkeys,
+ outer_relids);
+
+ label_sort_with_costsize(root, (Sort *) sort_plan, -1.0);
+ }
+ else
+ {
+ sort_plan = (Plan *)
+ make_incrementalsort_from_pathkeys(outer_plan,
best_path->outersortkeys,
- outer_relids);
+ outer_relids,
+ presorted_keys);
- label_sort_with_costsize(root, sort, -1.0);
- outer_plan = (Plan *) sort;
+ label_incrementalsort_with_costsize(root,
+ (IncrementalSort *) sort_plan,
+ best_path->outersortkeys,
+ -1.0);
+ }
+
+ outer_plan = sort_plan;
outerpathkeys = best_path->outersortkeys;
}
else
@@ -5447,10 +5484,6 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
- /*
- * This function shouldn't have to deal with IncrementalSort plans because
- * they are only created from corresponding Path nodes.
- */
Assert(IsA(plan, Sort));
cost_sort(&sort_path, root, NIL,
@@ -5469,6 +5502,37 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
plan->plan.parallel_safe = lefttree->parallel_safe;
}
+/*
+ * Same as label_sort_with_costsize, but labels the IncrementalSort node
+ * instead.
+ */
+static void
+label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
+ List *pathkeys, double limit_tuples)
+{
+ Plan *lefttree = plan->sort.plan.lefttree;
+ Path sort_path; /* dummy for result of cost_incremental_sort */
+
+ Assert(IsA(plan, IncrementalSort));
+
+ cost_incremental_sort(&sort_path, root, pathkeys,
+ plan->nPresortedCols,
+ plan->sort.plan.disabled_nodes,
+ lefttree->startup_cost,
+ lefttree->total_cost,
+ lefttree->plan_rows,
+ lefttree->plan_width,
+ 0.0,
+ work_mem,
+ limit_tuples);
+ plan->sort.plan.startup_cost = sort_path.startup_cost;
+ plan->sort.plan.total_cost = sort_path.total_cost;
+ plan->sort.plan.plan_rows = lefttree->plan_rows;
+ plan->sort.plan.plan_width = lefttree->plan_width;
+ plan->sort.plan.parallel_aware = false;
+ plan->sort.plan.parallel_safe = lefttree->parallel_safe;
+}
+
/*
* bitmap_subplan_mark_shared
* Set isshared flag in bitmap subplan so that it will be created in
--
2.43.0
On Mon, Sep 9, 2024 at 6:22 PM Tomas Vondra <tomas@vondra.me> wrote:
I have not thought about this particular case too much, but how likely
is it that mergejoin will win for this plan in practice? If we consider
a query that only needs a fraction of rows (say, thanks to LIMIT),
aren't we likely to pick a nested loop (which can do incremental sort
for the outer plan)? For joins that need to run to completion it might
be a win, but there's also the higher risk of a poor costing.
I think one situation where mergejoin tends to outperform hashjoin and
nestloop is when ORDER BY clauses are present. For example, for the
query below, mergejoin runs much faster than hashjoin and nestloop,
and mergejoin with incremental sort is even faster than mergejoin with
full sort.
create table t (a int, b int);
insert into t select random(1,100), random(1,100) from
generate_series(1,100000);
analyze t;
-- on patched
explain (analyze, costs off, timing off)
select * from (select * from t order by a) t1 join t t2 on t1.a = t2.a
and t1.b = t2.b order by t1.a, t1.b;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Merge Join (actual rows=1100372 loops=1)
Merge Cond: ((t.a = t2.a) AND (t.b = t2.b))
-> Incremental Sort (actual rows=100000 loops=1)
Sort Key: t.a, t.b
Presorted Key: t.a
Full-sort Groups: 100 Sort Method: quicksort Average
Memory: 26kB Peak Memory: 26kB
Pre-sorted Groups: 100 Sort Method: quicksort Average
Memory: 74kB Peak Memory: 74kB
-> Sort (actual rows=100000 loops=1)
Sort Key: t.a
Sort Method: external merge Disk: 1768kB
-> Seq Scan on t (actual rows=100000 loops=1)
-> Sort (actual rows=1100367 loops=1)
Sort Key: t2.a, t2.b
Sort Method: external sort Disk: 2160kB
-> Seq Scan on t t2 (actual rows=100000 loops=1)
Planning Time: 0.912 ms
Execution Time: 854.502 ms
(17 rows)
-- disable incremental sort
set enable_incremental_sort to off;
explain (analyze, costs off, timing off)
select * from (select * from t order by a) t1 join t t2 on t1.a = t2.a
and t1.b = t2.b order by t1.a, t1.b;
QUERY PLAN
--------------------------------------------------------------
Merge Join (actual rows=1100372 loops=1)
Merge Cond: ((t.a = t2.a) AND (t.b = t2.b))
-> Sort (actual rows=100000 loops=1)
Sort Key: t.a, t.b
Sort Method: external merge Disk: 1768kB
-> Sort (actual rows=100000 loops=1)
Sort Key: t.a
Sort Method: external merge Disk: 1768kB
-> Seq Scan on t (actual rows=100000 loops=1)
-> Sort (actual rows=1100367 loops=1)
Sort Key: t2.a, t2.b
Sort Method: external sort Disk: 2160kB
-> Seq Scan on t t2 (actual rows=100000 loops=1)
Planning Time: 1.451 ms
Execution Time: 958.607 ms
(15 rows)
-- with hashjoin
set enable_mergejoin to off;
explain (analyze, costs off, timing off)
select * from (select * from t order by a) t1 join t t2 on t1.a = t2.a
and t1.b = t2.b order by t1.a, t1.b;
QUERY PLAN
--------------------------------------------------------------------
Sort (actual rows=1100372 loops=1)
Sort Key: t.a, t.b
Sort Method: external merge Disk: 28000kB
-> Hash Join (actual rows=1100372 loops=1)
Hash Cond: ((t2.a = t.a) AND (t2.b = t.b))
-> Seq Scan on t t2 (actual rows=100000 loops=1)
-> Hash (actual rows=100000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 4931kB
-> Sort (actual rows=100000 loops=1)
Sort Key: t.a
Sort Method: external merge Disk: 1768kB
-> Seq Scan on t (actual rows=100000 loops=1)
Planning Time: 1.998 ms
Execution Time: 2469.954 ms
(14 rows)
-- with nestloop, my small machine seems runs forever.
Thanks
Richard
On 9/13/24 06:04, Richard Guo wrote:
On Mon, Sep 9, 2024 at 6:22 PM Tomas Vondra <tomas@vondra.me> wrote:
I think we intentionally added incremental sort ... incrementally ;-)
Haha, right.
I think one challenge with this case is that create_mergejoin_plan
creates these Sort plans explicitly - it's not "pathified" so it doesn't
go through the usual cost comparison etc. And it certainly is not the
case that incremental sort would win always, so we'd need to replicate
the cost comparison logic too.I have not thought about this particular case too much, but how likely
is it that mergejoin will win for this plan in practice? If we consider
a query that only needs a fraction of rows (say, thanks to LIMIT),
aren't we likely to pick a nested loop (which can do incremental sort
for the outer plan)? For joins that need to run to completion it might
be a win, but there's also the higher risk of a poor costing.Yeah, currently mergejoin path always assumes that full sort would be
used on top of the outer path and inner path if they are not already
ordered appropriately. So initial_cost_mergejoin directly charges the
cost of full sort into outer/inner path's cost, without going through
the usual cost comparison with incremental sort.It seems to me that some parts of our code assume that, for a given
input path that is partially ordered, an incremental sort is always
preferable to a full sort (see commit 4a29eabd1). Am I getting this
correctly?
I don't think we're making such assumption. I don't know which of the
many places modified by 4a29eabd1 you have in mind, but IIRC we always
consider both full and incremental sort.
If this is the case, then I think using the following
outer path for the merge join:-> Incremental Sort
Sort Key: t1.a, t1.b
Presorted Key: t1.a
-> Index Scan using t1_a_idx on t1... would be an immediate improvement over the current path, which is:
-> Sort
Sort Key: t1.a, t1.b
-> Index Scan using t1_a_idx on t1The shaky cost estimates for incremental sort that you mentioned are
indeed a concern. Fortunately we have the enable_incremental_sort GUC
already. As in may other parts of the code (such as in the function
add_paths_to_grouping_rel), we can always disable incremental sort to
fall back to a full sort if needed.
I think our goal should be to not rely on the enable_incremental_sort
GUC as a defense very much. It's a very blunt instrument, that often
forces users to pick whether they prefer to improve one query while
harming some other queries. I personally consider these enable_ GUC more
a development tool than something suitable for users.
I'm not saying it's not worth exploring, I'm just trying recall reasons
why it might not work. Also I don't think it's fundamentally impossible
to do mark/restore for incremental sort - I just haven't tried doing it
because it wasn't necessary. In the worst case we could simply add a
Materialize node on top, no?Yeah, perhaps we could support mark/restore for incremental sort
someday. This would allow us to apply incremental sort to the inner
path of a merge join too. But if we apply a Material node on top of
the inner due to the lack of mark/restore support for incremental
sort, then we will need to compare two paths:partially sorted path -> incremental sort -> material
VS.
partially sorted path -> full sort
I think it's hard to tell which is cheaper without a cost comparison,
which we do not have for now.
How is this different from the incremental sort costing we already have?
To help with the discussion, I drafted a WIP patch as attached, which
chooses an incremental sort over a full sort on the given outer path
of a mergejoin whenever possible. There is one ensuing plan diff in
the regression tests (not included in the patch). It seems that some
query in the tests begins to use incremental sort for mergejoin.
I'm not against the patch in principle, but I haven't thought about the
costing and risk very much. If I have time I'll try to do some more
experiments to see how it behaves, but no promises.
regards
--
Tomas Vondra
On 9/13/24 13:18, Richard Guo wrote:
On Mon, Sep 9, 2024 at 6:22 PM Tomas Vondra <tomas@vondra.me> wrote:
I have not thought about this particular case too much, but how likely
is it that mergejoin will win for this plan in practice? If we consider
a query that only needs a fraction of rows (say, thanks to LIMIT),
aren't we likely to pick a nested loop (which can do incremental sort
for the outer plan)? For joins that need to run to completion it might
be a win, but there's also the higher risk of a poor costing.I think one situation where mergejoin tends to outperform hashjoin and
nestloop is when ORDER BY clauses are present. For example, for the
query below, mergejoin runs much faster than hashjoin and nestloop,
and mergejoin with incremental sort is even faster than mergejoin with
full sort.
Sure, an incremental sort can improve things if things go well, no doubt
about that. But how significant can the improvement be, especially if we
need to sort everything? In your example it's ~15%, which is nice, and
maybe the benefits can be much larger if the incremental sort can do
everything in memory, without flushing to disk.
But what about the risks? What will happen if the groups are not this
uniformly and independently sized, and stuff like that? Maybe it'll be
costed well enough, I haven't tried.
I don't recall the exact reasoning for why we didn't add incremental
sort in various places, I'd have to dig into the old threads, or
something. But I believe thinking about these risks was part of it -
trying to handle places where the potential benefits are much larger
than the risks.
As I wrote earlier, it's not my intent to discourage you from working on
this. Please do, I'm sure it can be improved.
regards
--
Tomas Vondra
On Fri, Sep 13, 2024 at 7:35 PM Tomas Vondra <tomas@vondra.me> wrote:
On 9/13/24 06:04, Richard Guo wrote:
It seems to me that some parts of our code assume that, for a given
input path that is partially ordered, an incremental sort is always
preferable to a full sort (see commit 4a29eabd1). Am I getting this
correctly?I don't think we're making such assumption. I don't know which of the
many places modified by 4a29eabd1 you have in mind, but IIRC we always
consider both full and incremental sort.
Hmm, I don't think it's true that we always consider both full and
incremental sort on the same path. It was true initially, but that’s
no longer the case.
I checked the 9 callers of create_incremental_sort_path, and they all
follow the logic that if there are presorted keys, only incremental
sort is considered. To quote a comment from one of the callers:
* ... We've no need to consider both
* sort and incremental sort on the same path. We assume that
* incremental sort is always faster when there are presorted
* keys.
I think this is also explained in the commit message of 4a29eabd1,
quoted here:
"
Previously we would generally create a sort path on the cheapest input
path (if that wasn't sorted already) and incremental sort paths on any
path which had presorted keys. This meant that if the cheapest input path
wasn't completely sorted but happened to have presorted keys, we would
create a full sort path *and* an incremental sort path on that input path.
Here we change this logic so that if there are presorted keys, we only
create an incremental sort path, and create sort paths only when a full
sort is required.
"
Thanks
Richard
On Fri, Sep 13, 2024 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:
Sure, an incremental sort can improve things if things go well, no doubt
about that. But how significant can the improvement be, especially if we
need to sort everything? In your example it's ~15%, which is nice, and
maybe the benefits can be much larger if the incremental sort can do
everything in memory, without flushing to disk.But what about the risks? What will happen if the groups are not this
uniformly and independently sized, and stuff like that? Maybe it'll be
costed well enough, I haven't tried.
I understand the concern and agree that there is a risk of regression
if there is a skew in the number of rows per pre-sorted group.
Actually there were discussions about this during the work on commit
4a29eabd1. Please see [1]/messages/by-id/CAApHDvr1Sm+g9hbv4REOVuvQKeDWXcKUAhmbK5K+dfun0s9CvA@mail.gmail.com. I will repeat David's demonstration and
rerun his query on the current master branch to see what happens.
create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,1000000)x;
create index on ab (a);
analyze ab;
So this is a table with a very large skew: the 0 group has 999000
rows, and the remaining groups 1-1000 have just 1 row each.
-- on master
explain (analyze, timing off) select * from ab order by a,b;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=2767.38..109344.55 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 33 Sort Method: quicksort Average Memory: 26kB
Peak Memory: 26kB
Pre-sorted Groups: 1 Sort Method: external merge Average Disk:
17680kB Peak Disk: 17680kB
-> Index Scan using ab_a_idx on ab (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
Planning Time: 0.829 ms
Execution Time: 1028.892 ms
(8 rows)
-- manually disable incremental sort
set enable_incremental_sort to off;
explain (analyze, timing off) select * from ab order by a,b;
QUERY PLAN
------------------------------------------------------------------------------------------------
Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Seq Scan on ab (cost=0.00..14425.00 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Planning Time: 0.814 ms
Execution Time: 765.198 ms
(6 rows)
Look, regression happens on current master!
This is a question I’ve often pondered: each time we introduce a new
optimization, there are always potential cases where it could lead to
regressions. What should we do about this? I kind of agree on
David's option that, as in the commit message of 4a29eabd1:
"
That, of course, as with teaching the planner any new tricks,
means an increased likelihood that the planner will perform an incremental
sort when it's not the best method. Our standard escape hatch for these
cases is an enable_* GUC.
"
I know ideally we should not rely on these enable_* GUCs, but in
reality it seems that sometimes we do not have a better solution.
[1]: /messages/by-id/CAApHDvr1Sm+g9hbv4REOVuvQKeDWXcKUAhmbK5K+dfun0s9CvA@mail.gmail.com
Thanks
Richard
On 9/14/24 05:50, Richard Guo wrote:
On Fri, Sep 13, 2024 at 7:51 PM Tomas Vondra <tomas@vondra.me> wrote:
Sure, an incremental sort can improve things if things go well, no doubt
about that. But how significant can the improvement be, especially if we
need to sort everything? In your example it's ~15%, which is nice, and
maybe the benefits can be much larger if the incremental sort can do
everything in memory, without flushing to disk.But what about the risks? What will happen if the groups are not this
uniformly and independently sized, and stuff like that? Maybe it'll be
costed well enough, I haven't tried.I understand the concern and agree that there is a risk of regression
if there is a skew in the number of rows per pre-sorted group.Actually there were discussions about this during the work on commit
4a29eabd1. Please see [1]. I will repeat David's demonstration and
rerun his query on the current master branch to see what happens.create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,1000000)x;create index on ab (a);
analyze ab;
So this is a table with a very large skew: the 0 group has 999000
rows, and the remaining groups 1-1000 have just 1 row each.-- on master
explain (analyze, timing off) select * from ab order by a,b;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=2767.38..109344.55 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 33 Sort Method: quicksort Average Memory: 26kB
Peak Memory: 26kB
Pre-sorted Groups: 1 Sort Method: external merge Average Disk:
17680kB Peak Disk: 17680kB
-> Index Scan using ab_a_idx on ab (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
Planning Time: 0.829 ms
Execution Time: 1028.892 ms
(8 rows)-- manually disable incremental sort
set enable_incremental_sort to off;
explain (analyze, timing off) select * from ab order by a,b;
QUERY PLAN
------------------------------------------------------------------------------------------------
Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Seq Scan on ab (cost=0.00..14425.00 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Planning Time: 0.814 ms
Execution Time: 765.198 ms
(6 rows)Look, regression happens on current master!
This is a question I’ve often pondered: each time we introduce a new
optimization, there are always potential cases where it could lead to
regressions. What should we do about this? I kind of agree on
David's option that, as in the commit message of 4a29eabd1:
Right, or as Goetz Graefe said "choice is confusion."
The funny thing is it also matters when the alternative plans are
introduced. Had it been at the very beginning, we'd consider the
behavior (including choosing sub-optimal plans) the baseline, and it'd
be okay-ish. But when we're introducing those alternative paths later,
it's more likely to be seen as a "regression" ...
"
That, of course, as with teaching the planner any new tricks,
means an increased likelihood that the planner will perform an incremental
sort when it's not the best method. Our standard escape hatch for these
cases is an enable_* GUC.
"I know ideally we should not rely on these enable_* GUCs, but in
reality it seems that sometimes we do not have a better solution.
Right, the basic truth is there's no way to teach the optimizer to do
new stuff without a risk of regression. We simply can't do perfect
decisions based on incomplete information (which is what the statistics
are, intentionally). It is up to us to reason about the risks, and
ideally deal with that later at execution time.
I still don't think we should rely on GUCs too much for this.
regards
--
Tomas Vondra
On 9/14/24 05:37, Richard Guo wrote:
On Fri, Sep 13, 2024 at 7:35 PM Tomas Vondra <tomas@vondra.me> wrote:
On 9/13/24 06:04, Richard Guo wrote:
It seems to me that some parts of our code assume that, for a given
input path that is partially ordered, an incremental sort is always
preferable to a full sort (see commit 4a29eabd1). Am I getting this
correctly?I don't think we're making such assumption. I don't know which of the
many places modified by 4a29eabd1 you have in mind, but IIRC we always
consider both full and incremental sort.Hmm, I don't think it's true that we always consider both full and
incremental sort on the same path. It was true initially, but that’s
no longer the case.I checked the 9 callers of create_incremental_sort_path, and they all
follow the logic that if there are presorted keys, only incremental
sort is considered. To quote a comment from one of the callers:* ... We've no need to consider both
* sort and incremental sort on the same path. We assume that
* incremental sort is always faster when there are presorted
* keys.I think this is also explained in the commit message of 4a29eabd1,
quoted here:"
Previously we would generally create a sort path on the cheapest input
path (if that wasn't sorted already) and incremental sort paths on any
path which had presorted keys. This meant that if the cheapest input path
wasn't completely sorted but happened to have presorted keys, we would
create a full sort path *and* an incremental sort path on that input path.
Here we change this logic so that if there are presorted keys, we only
create an incremental sort path, and create sort paths only when a full
sort is required.
"
Hmmm ... I wasn't involved in that discussion and I haven't studied the
thread now, but this seems a bit weird to me. If the presorted keys have
low cardinality, can't the full sort be faster?
Maybe it's not possible with the costing changes (removing the
penalization), in which case it would be fine to not consider the full
sort, because we'd just throw it away immediately. But if the full sort
could be costed as cheaper, I'd say that might be wrong.
regards
--
Tomas Vondra
On Sun, Sep 15, 2024 at 6:01 AM Tomas Vondra <tomas@vondra.me> wrote:
Hmmm ... I wasn't involved in that discussion and I haven't studied the
thread now, but this seems a bit weird to me. If the presorted keys have
low cardinality, can't the full sort be faster?Maybe it's not possible with the costing changes (removing the
penalization), in which case it would be fine to not consider the full
sort, because we'd just throw it away immediately. But if the full sort
could be costed as cheaper, I'd say that might be wrong.
IIUC, we initially applied a 1.5x pessimism factor to the cost
estimates of incremental sort in an effort to avoid performance
regressions in cases with a large skew in the number of rows within
the presorted groups. OTOH, this also restricted our ability to
leverage incremental sort when it would be preferable.
I agree with you that sometimes the definition of 'regression' can
depend on when the alternative plans are introduced. Imagine if we
initially did not have the 1.5x pessimism factor and introduced it
later, it would be treated as a 'regression' because some queries that
could benefit from incremental sort would then have to resort to full
sort.
In commit 4a29eabd1, we removed the 1.5x pessimism factor to allow
incremental sort to have a fairer chance at being chosen against a
full sort. With this change, the cost model now tends to favor
incremental sort as being cheaper than full sort in the presence of
presorted keys, making it unnecessary to consider full sort in such
cases, because, as you mentioned, we'd just throw the full sort path
away immediately. So 4a29eabd1 also modified the logic so that if
there are presorted keys, we only create an incremental sort path.
As for the potential performance regressions caused by these changes,
4a29eabd1 opted to use enable_incremental_sort as an escape hatch.
I think the same theory applies to mergejoins too. We can leverage
incremental sort if it is enabled and there are presorted keys,
assuming that it is always faster than full sort, and use the GUC as
an escape hatch in the worst case.
So here is the v2 patch, which is almost the same as v1, but includes
changes in test cases and comments, and also includes a commit
message. I'll put it in the commitfest.
Thanks
Richard
Attachments:
v2-0001-Consider-explicit-incremental-sort-for-mergejoins.patchapplication/octet-stream; name=v2-0001-Consider-explicit-incremental-sort-for-mergejoins.patchDownload
From 426c0481ae5714301f775cc4c294fb4caf33a506 Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Mon, 9 Sep 2024 19:57:53 +0900
Subject: [PATCH v2] Consider explicit incremental sort for mergejoins
For a mergejoin, if the given outer path or inner path is not already
well enough ordered, we need to do an explicit sort. Currently, we
only consider explicit full sort and do not account for incremental
sort.
In this patch, for the outer path of a mergejoin, we choose to use
explicit incremental sort if it is enabled and there are presorted
keys. For the inner path, though, we cannot use incremental sort
because it does not support mark/restore at present.
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups. Nonetheless, the current
cost model tends to favor incremental sort as being cheaper than full
sort in the presence of presorted keys, making it reasonable not to
consider full sort in such cases.
It could be argued that what if a mergejoin with an incremental sort
as the outer path is selected as the inner path of another mergejoin.
However, this should not be a problem, because mergejoin itself does
not support mark/restore either, and we will add a Material node on
top of it anyway in this case (see final_cost_mergejoin).
There is one ensuing plan change in the regression tests, and we have
to modify that test case to ensure that it continues to test what it
is intended to.
No backpatch as this could result in plan changes.
Author: Richard Guo
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/CAMbWs49x425QrX7h=Ux05WEnt8GS757H-jOP3_xsX5t1FoUsZw@mail.gmail.com
---
src/backend/optimizer/path/costsize.c | 69 +++++++++++---
src/backend/optimizer/plan/createplan.c | 89 +++++++++++++++++--
src/test/regress/expected/aggregates.out | 34 ++++---
.../regress/expected/incremental_sort.out | 21 +++++
src/test/regress/sql/aggregates.sql | 6 +-
src/test/regress/sql/incremental_sort.sql | 6 ++
6 files changed, 184 insertions(+), 41 deletions(-)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index e1523d15df..c6e66e46f4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -3532,7 +3532,8 @@ final_cost_nestloop(PlannerInfo *root, NestPath *path,
* join quals here, except for obtaining the scan selectivity estimate which
* is really essential (but fortunately, use of caching keeps the cost of
* getting that down to something reasonable).
- * We also assume that cost_sort is cheap enough to use here.
+ * We also assume that cost_sort/cost_incremental_sort is cheap enough to use
+ * here.
*
* 'workspace' is to be filled with startup_cost, total_cost, and perhaps
* other data to be used by final_cost_mergejoin
@@ -3569,7 +3570,8 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
outerendsel,
innerstartsel,
innerendsel;
- Path sort_path; /* dummy for result of cost_sort */
+ Path sort_path; /* dummy for result of
+ * cost_sort/cost_incremental_sort */
/* Protect some assumptions below that rowcounts aren't zero */
if (outer_path_rows <= 0)
@@ -3682,16 +3684,54 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
if (outersortkeys) /* do we need to sort outer? */
{
- cost_sort(&sort_path,
- root,
- outersortkeys,
- outer_path->disabled_nodes,
- outer_path->total_cost,
- outer_path_rows,
- outer_path->pathtarget->width,
- 0.0,
- work_mem,
- -1.0);
+ bool use_incremental_sort = false;
+ int presorted_keys;
+
+ /*
+ * We choose to use incremental sort if it is enabled and there are
+ * presorted keys; otherwise we use full sort.
+ */
+ if (enable_incremental_sort)
+ {
+ bool is_sorted PG_USED_FOR_ASSERTS_ONLY;
+
+ is_sorted = pathkeys_count_contained_in(outersortkeys,
+ outer_path->pathkeys,
+ &presorted_keys);
+ Assert(!is_sorted);
+
+ if (presorted_keys > 0)
+ use_incremental_sort = true;
+ }
+
+ if (!use_incremental_sort)
+ {
+ cost_sort(&sort_path,
+ root,
+ outersortkeys,
+ outer_path->disabled_nodes,
+ outer_path->total_cost,
+ outer_path_rows,
+ outer_path->pathtarget->width,
+ 0.0,
+ work_mem,
+ -1.0);
+ }
+ else
+ {
+ cost_incremental_sort(&sort_path,
+ root,
+ outersortkeys,
+ presorted_keys,
+ outer_path->disabled_nodes,
+ outer_path->startup_cost,
+ outer_path->total_cost,
+ outer_path_rows,
+ outer_path->pathtarget->width,
+ 0.0,
+ work_mem,
+ -1.0);
+ }
disabled_nodes += sort_path.disabled_nodes;
startup_cost += sort_path.startup_cost;
startup_cost += (sort_path.total_cost - sort_path.startup_cost)
@@ -3711,6 +3751,11 @@ initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
if (innersortkeys) /* do we need to sort inner? */
{
+ /*
+ * We do not consider incremental sort for inner path, because
+ * incremental sort does not support mark/restore.
+ */
+
cost_sort(&sort_path,
root,
innersortkeys,
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bb45ef318f..0d195a07ff 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -179,6 +179,8 @@ static void copy_generic_path_info(Plan *dest, Path *src);
static void copy_plan_costsize(Plan *dest, Plan *src);
static void label_sort_with_costsize(PlannerInfo *root, Sort *plan,
double limit_tuples);
+static void label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
+ List *pathkeys, double limit_tuples);
static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
TableSampleClause *tsc);
@@ -4523,12 +4525,51 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->outersortkeys)
{
Relids outer_relids = outer_path->parent->relids;
- Sort *sort = make_sort_from_pathkeys(outer_plan,
+ Plan *sort_plan;
+ bool use_incremental_sort = false;
+ int presorted_keys;
+
+ /*
+ * We choose to use incremental sort if it is enabled and there are
+ * presorted keys; otherwise we use full sort.
+ */
+ if (enable_incremental_sort)
+ {
+ bool is_sorted PG_USED_FOR_ASSERTS_ONLY;
+
+ is_sorted = pathkeys_count_contained_in(best_path->outersortkeys,
+ outer_path->pathkeys,
+ &presorted_keys);
+ Assert(!is_sorted);
+
+ if (presorted_keys > 0)
+ use_incremental_sort = true;
+ }
+
+ if (!use_incremental_sort)
+ {
+ sort_plan = (Plan *)
+ make_sort_from_pathkeys(outer_plan,
+ best_path->outersortkeys,
+ outer_relids);
+
+ label_sort_with_costsize(root, (Sort *) sort_plan, -1.0);
+ }
+ else
+ {
+ sort_plan = (Plan *)
+ make_incrementalsort_from_pathkeys(outer_plan,
best_path->outersortkeys,
- outer_relids);
+ outer_relids,
+ presorted_keys);
- label_sort_with_costsize(root, sort, -1.0);
- outer_plan = (Plan *) sort;
+ label_incrementalsort_with_costsize(root,
+ (IncrementalSort *) sort_plan,
+ best_path->outersortkeys,
+ -1.0);
+ }
+
+ outer_plan = sort_plan;
outerpathkeys = best_path->outersortkeys;
}
else
@@ -4536,6 +4577,11 @@ create_mergejoin_plan(PlannerInfo *root,
if (best_path->innersortkeys)
{
+ /*
+ * We do not consider incremental sort for inner path, because
+ * incremental sort does not support mark/restore.
+ */
+
Relids inner_relids = inner_path->parent->relids;
Sort *sort = make_sort_from_pathkeys(inner_plan,
best_path->innersortkeys,
@@ -5447,10 +5493,6 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
Plan *lefttree = plan->plan.lefttree;
Path sort_path; /* dummy for result of cost_sort */
- /*
- * This function shouldn't have to deal with IncrementalSort plans because
- * they are only created from corresponding Path nodes.
- */
Assert(IsA(plan, Sort));
cost_sort(&sort_path, root, NIL,
@@ -5469,6 +5511,37 @@ label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
plan->plan.parallel_safe = lefttree->parallel_safe;
}
+/*
+ * Same as label_sort_with_costsize, but labels the IncrementalSort node
+ * instead.
+ */
+static void
+label_incrementalsort_with_costsize(PlannerInfo *root, IncrementalSort *plan,
+ List *pathkeys, double limit_tuples)
+{
+ Plan *lefttree = plan->sort.plan.lefttree;
+ Path sort_path; /* dummy for result of cost_incremental_sort */
+
+ Assert(IsA(plan, IncrementalSort));
+
+ cost_incremental_sort(&sort_path, root, pathkeys,
+ plan->nPresortedCols,
+ plan->sort.plan.disabled_nodes,
+ lefttree->startup_cost,
+ lefttree->total_cost,
+ lefttree->plan_rows,
+ lefttree->plan_width,
+ 0.0,
+ work_mem,
+ limit_tuples);
+ plan->sort.plan.startup_cost = sort_path.startup_cost;
+ plan->sort.plan.total_cost = sort_path.total_cost;
+ plan->sort.plan.plan_rows = lefttree->plan_rows;
+ plan->sort.plan.plan_width = lefttree->plan_width;
+ plan->sort.plan.parallel_aware = false;
+ plan->sort.plan.parallel_safe = lefttree->parallel_safe;
+}
+
/*
* bitmap_subplan_mark_shared
* Set isshared flag in bitmap subplan so that it will be created in
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 8ac13b562c..d0b438b182 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2832,29 +2832,27 @@ GROUP BY w, x, z, y;
-> Index Scan using btg_x_y_idx on btg
(6 rows)
--- Utilize the ordering of merge join to avoid a full Sort operation
+-- Utilize the ordering of merge join to avoid a Sort operation
SET enable_hashjoin = off;
SET enable_nestloop = off;
EXPLAIN (COSTS OFF)
SELECT count(*)
- FROM btg t1 JOIN btg t2 ON t1.z = t2.z AND t1.w = t2.w AND t1.x = t2.x
- GROUP BY t1.x, t1.y, t1.z, t1.w;
- QUERY PLAN
--------------------------------------------------------------------------------
+ FROM btg t1 JOIN btg t2 ON t1.w = t2.w AND t1.x = t2.x AND t1.z = t2.z
+ GROUP BY t1.w, t1.z, t1.x;
+ QUERY PLAN
+-------------------------------------------------------------------------
GroupAggregate
- Group Key: t1.z, t1.w, t1.x, t1.y
- -> Incremental Sort
- Sort Key: t1.z, t1.w, t1.x, t1.y
- Presorted Key: t1.z, t1.w, t1.x
- -> Merge Join
- Merge Cond: ((t1.z = t2.z) AND (t1.w = t2.w) AND (t1.x = t2.x))
- -> Sort
- Sort Key: t1.z, t1.w, t1.x
- -> Index Scan using btg_x_y_idx on btg t1
- -> Sort
- Sort Key: t2.z, t2.w, t2.x
- -> Index Scan using btg_x_y_idx on btg t2
-(13 rows)
+ Group Key: t1.x, t1.w, t1.z
+ -> Merge Join
+ Merge Cond: ((t1.x = t2.x) AND (t1.w = t2.w) AND (t1.z = t2.z))
+ -> Incremental Sort
+ Sort Key: t1.x, t1.w, t1.z
+ Presorted Key: t1.x
+ -> Index Scan using btg_x_y_idx on btg t1
+ -> Sort
+ Sort Key: t2.x, t2.w, t2.z
+ -> Index Scan using btg_x_y_idx on btg t2
+(11 rows)
RESET enable_nestloop;
RESET enable_hashjoin;
diff --git a/src/test/regress/expected/incremental_sort.out b/src/test/regress/expected/incremental_sort.out
index 79f0d37a87..c561b62b2d 100644
--- a/src/test/regress/expected/incremental_sort.out
+++ b/src/test/regress/expected/incremental_sort.out
@@ -1701,3 +1701,24 @@ explain (costs off) select a, b, a <-> point(5, 5) dist from point_table order b
Order By: (a <-> '(5,5)'::point)
(6 rows)
+-- Ensure we get an incremental sort on the outer side of the mergejoin
+explain (costs off)
+select * from
+ (select * from tenk1 order by four) t1 join tenk1 t2 on t1.four = t2.four and t1.two = t2.two
+order by t1.four, t1.two limit 1;
+ QUERY PLAN
+-----------------------------------------------------------------------
+ Limit
+ -> Merge Join
+ Merge Cond: ((tenk1.four = t2.four) AND (tenk1.two = t2.two))
+ -> Incremental Sort
+ Sort Key: tenk1.four, tenk1.two
+ Presorted Key: tenk1.four
+ -> Sort
+ Sort Key: tenk1.four
+ -> Seq Scan on tenk1
+ -> Sort
+ Sort Key: t2.four, t2.two
+ -> Seq Scan on tenk1 t2
+(12 rows)
+
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index ca6d1bcfb7..b9ebc9337d 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -1223,13 +1223,13 @@ EXPLAIN (COSTS OFF) SELECT count(*)
FROM (SELECT * FROM btg ORDER BY x, y, w, z) AS q1
GROUP BY w, x, z, y;
--- Utilize the ordering of merge join to avoid a full Sort operation
+-- Utilize the ordering of merge join to avoid a Sort operation
SET enable_hashjoin = off;
SET enable_nestloop = off;
EXPLAIN (COSTS OFF)
SELECT count(*)
- FROM btg t1 JOIN btg t2 ON t1.z = t2.z AND t1.w = t2.w AND t1.x = t2.x
- GROUP BY t1.x, t1.y, t1.z, t1.w;
+ FROM btg t1 JOIN btg t2 ON t1.w = t2.w AND t1.x = t2.x AND t1.z = t2.z
+ GROUP BY t1.w, t1.z, t1.x;
RESET enable_nestloop;
RESET enable_hashjoin;
diff --git a/src/test/regress/sql/incremental_sort.sql b/src/test/regress/sql/incremental_sort.sql
index ab471bdfff..98b20e17e1 100644
--- a/src/test/regress/sql/incremental_sort.sql
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -292,3 +292,9 @@ create index point_table_a_idx on point_table using gist(a);
-- Ensure we get an incremental sort plan for both of the following queries
explain (costs off) select a, b, a <-> point(5, 5) dist from point_table order by dist, b limit 1;
explain (costs off) select a, b, a <-> point(5, 5) dist from point_table order by dist, b desc limit 1;
+
+-- Ensure we get an incremental sort on the outer side of the mergejoin
+explain (costs off)
+select * from
+ (select * from tenk1 order by four) t1 join tenk1 t2 on t1.four = t2.four and t1.two = t2.two
+order by t1.four, t1.two limit 1;
--
2.43.0
On Fri, 20 Sept 2024 at 20:48, Richard Guo <guofenglinux@gmail.com> wrote:
I agree with you that sometimes the definition of 'regression' can
depend on when the alternative plans are introduced. Imagine if we
initially did not have the 1.5x pessimism factor and introduced it
later, it would be treated as a 'regression' because some queries that
could benefit from incremental sort would then have to resort to full
sort.
I think this is a good way of looking at it. I think it's a bad idea
when introducing new planner/executor abilities to penalise the costs
for that ability. It might make the committer sleep better at night
after committing some new feature, but it's just not a future-proof
endeavour. We should aim for our cost models to be as close to the
truth as possible. As soon as you start introducing "just in case"
penalties, you're telling lies. Lies don't work out well, especially
when compounded with other lies, which is exactly what you get when
layering Paths atop of Paths with just-in-case penalties added at each
level.
I was in this position with Memoize. The problem there is that when we
don't have any stats, we assume the n_distinct is 200. Memoize can
look quite attractive with such a small n_distinct estimate. I
invented SELFLAG_USED_DEFAULT to allow Memoize to steer clear when the
n_distinct estimate used the hard-coded default. I think that's an ok
thing to do as otherwise it could work out very badly if Nested Loop
-> Memoize was used instead of, say a Hash Join on a join problem with
millions of rows, all of them with distinct join values.
So here is the v2 patch, which is almost the same as v1, but includes
changes in test cases and comments, and also includes a commit
message. I'll put it in the commitfest.
Just looking at the commit message:
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups.
My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?
David
On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just looking at the commit message:
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups.My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?
I was thinking that when there’s a large skew in the number of rows
per pre-sorted group, incremental sort might underperform full sort.
As mentioned in the comments for cost_incremental_sort, it has to
detect the sort groups, plus it needs to perform tuplesort_reset after
each group. However, I doubt these factors would have a substantial
impact on the performance of incremental sort. So maybe in the worst
skewed groups case, incremental sort may still perform similarly to
full sort.
Another less obvious reason is that in cases of skewed groups, we may
significantly underestimate the cost of incremental sort. This could
result in choosing a more expensive subpath under the sort. Such as
the example in [1]/messages/by-id/CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com, we end up with IndexScan->IncrementalSort rather
than SeqScan->FullSort, while the former plan is more expensive to
execute. However, this point does not affect this patch, because for
a mergejoin, we only consider outerrel's cheapest-total-cost when we
assume that an explicit sort atop is needed, i.e., we do not have a
chance to select which subpath to use in this case.
[1]: /messages/by-id/CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com
Thanks
Richard
On 9/23/24 05:03, Richard Guo wrote:
On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just looking at the commit message:
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups.My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?I was thinking that when there’s a large skew in the number of rows
per pre-sorted group, incremental sort might underperform full sort.
As mentioned in the comments for cost_incremental_sort, it has to
detect the sort groups, plus it needs to perform tuplesort_reset after
each group. However, I doubt these factors would have a substantial
impact on the performance of incremental sort. So maybe in the worst
skewed groups case, incremental sort may still perform similarly to
full sort.Another less obvious reason is that in cases of skewed groups, we may
significantly underestimate the cost of incremental sort. This could
result in choosing a more expensive subpath under the sort. Such as
the example in [1], we end up with IndexScan->IncrementalSort rather
than SeqScan->FullSort, while the former plan is more expensive to
execute. However, this point does not affect this patch, because for
a mergejoin, we only consider outerrel's cheapest-total-cost when we
assume that an explicit sort atop is needed, i.e., we do not have a
chance to select which subpath to use in this case.[1] /messages/by-id/CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com
You don't need any skew. Consider this perfectly uniform dataset:
create table t (a int, b int);
insert into t select 100000 * random(), 100 * random()
from generate_series(1,1000000) s(i);
create index on t (a);
vacuum analyze;
checkpoint;
explain analyze select * from t order by a, b;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=127757.34..130257.34 rows=1000000 width=8)
(actual time=186.288..275.813 rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.005..35.989 rows=1000000 loops=1)
Planning Time: 0.075 ms
Execution Time: 301.143 ms
(6 rows)
set enable_incremental_sort = on;
explain analyze select * from t order by a, b;
QUERY PLAN
-----------------------------------------------------------------
Incremental Sort (cost=1.03..68908.13 rows=1000000 width=8)
(actual time=0.102..497.143 rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 27039 Sort Method: quicksort
Average Memory: 25kB Peak Memory: 25kB
-> Index Scan using t_a_idx on t
(cost=0.42..37244.25 rows=1000000 width=8)
(actual time=0.050..376.403 rows=1000000 loops=1)
Planning Time: 0.057 ms
Execution Time: 519.301 ms
(7 rows)
Sure, the table is very small, but the example is not crazy. In fact,
this is the "nicest" example for estimation - it's perfectly random, no
correlations, no skew.
regards
--
Tomas Vondra
On Tue, 24 Sept 2024 at 02:01, Tomas Vondra <tomas@vondra.me> wrote:
On 9/23/24 05:03, Richard Guo wrote:
On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just looking at the commit message:
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups.My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?I was thinking that when there’s a large skew in the number of rows
per pre-sorted group, incremental sort might underperform full sort.
As mentioned in the comments for cost_incremental_sort, it has to
detect the sort groups, plus it needs to perform tuplesort_reset after
each group. However, I doubt these factors would have a substantial
impact on the performance of incremental sort. So maybe in the worst
skewed groups case, incremental sort may still perform similarly to
full sort.Another less obvious reason is that in cases of skewed groups, we may
significantly underestimate the cost of incremental sort. This could
result in choosing a more expensive subpath under the sort. Such as
the example in [1], we end up with IndexScan->IncrementalSort rather
than SeqScan->FullSort, while the former plan is more expensive to
execute. However, this point does not affect this patch, because for
a mergejoin, we only consider outerrel's cheapest-total-cost when we
assume that an explicit sort atop is needed, i.e., we do not have a
chance to select which subpath to use in this case.[1] /messages/by-id/CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com
You don't need any skew. Consider this perfectly uniform dataset:
Sort (cost=127757.34..130257.34 rows=1000000 width=8)
(actual time=186.288..275.813 rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.005..35.989 rows=1000000 loops=1)
Planning Time: 0.075 ms
Execution Time: 301.143 msset enable_incremental_sort = on;
Incremental Sort (cost=1.03..68908.13 rows=1000000 width=8)
(actual time=0.102..497.143 rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 27039 Sort Method: quicksort
Average Memory: 25kB Peak Memory: 25kB
-> Index Scan using t_a_idx on t
(cost=0.42..37244.25 rows=1000000 width=8)
(actual time=0.050..376.403 rows=1000000 loops=1)
Planning Time: 0.057 ms
Execution Time: 519.301 ms
Ok, let's first look at the total Seq Scan cost of the first EXPLAIN.
14425.00 units and 35.989 milliseconds to execute. That's about 400.81
units per millisecond. The Index Scan is only being charged 98.94
units per millisecond of execution. If the Index Scan was costed the
same per unit as the Seq Scan, the total Index Scan cost would be
150868 which is already more than the Seq Scan plan without even
adding the Incremental Sort costs on. To me, that seems to be an
inaccuracy either with the Seq Scan costings coming out too expensive
or Index Scan coming out too cheap.
If you think that the Incremental Sort plan shouldn't be chosen
because the Index Scan costing came out too cheap (or the Seq Scan
costing was too expensive) then I disagree. Applying some penalty to
one node type because some other node type isn't costed accurately is
just not a practice we should be doing. Instead, we should be trying
our best to cost each node type as accurately as possible. If you
think there's some inaccuracy with Incremental Sort, then let's look
into that. If you want to re-add the penalty because Index Scan
costing isn't good, let's see if we can fix Index Scan costings.
David
On Mon, Sep 23, 2024 at 10:01 PM Tomas Vondra <tomas@vondra.me> wrote:
You don't need any skew. Consider this perfectly uniform dataset:
create table t (a int, b int);
insert into t select 100000 * random(), 100 * random()
from generate_series(1,1000000) s(i);
create index on t (a);
vacuum analyze;
checkpoint;explain analyze select * from t order by a, b;
I think if we want to compare the performance of incremental sort vs.
full sort on this dataset, we need to ensure that other variables are
kept constant, ie we need to ensure that both methods use the same
subpath, whether it's Index Scan or Seq Scan.
This is especially true in the scenario addressed by this patch, as
for a mergejoin, we only consider outerrel's cheapest_total_path when
we assume that an explicit sort atop is needed. That is to say, the
subpath has already been chosen when it comes to determine whether to
use incremental sort or full sort.
According to this theory, here is what I got on this same dataset:
-- incremental sort
explain analyze select * from t order by a, b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=1.02..68838.65 rows=1000000 width=8) (actual
time=1.292..1564.265 rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 27022 Sort Method: quicksort Average Memory:
25kB Peak Memory: 25kB
-> Index Scan using t_a_idx on t (cost=0.42..37244.20
rows=1000000 width=8) (actual time=0.408..1018.785 rows=1000000
loops=1)
Planning Time: 0.998 ms
Execution Time: 1602.861 ms
(7 rows)
-- full sort
set enable_incremental_sort to off;
set enable_seqscan to off;
explain analyze select * from t order by a, b;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=150576.54..153076.54 rows=1000000 width=8) (actual
time=1760.090..1927.598 rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Index Scan using t_a_idx on t (cost=0.42..37244.20
rows=1000000 width=8) (actual time=0.531..1010.931 rows=1000000
loops=1)
Planning Time: 0.980 ms
Execution Time: 1970.287 ms
(6 rows)
So incremental sort outperforms full sort on this dataset.
Thanks
Richard
On 9/24/24 00:21, David Rowley wrote:
On Tue, 24 Sept 2024 at 02:01, Tomas Vondra <tomas@vondra.me> wrote:
On 9/23/24 05:03, Richard Guo wrote:
On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just looking at the commit message:
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups.My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?I was thinking that when there’s a large skew in the number of rows
per pre-sorted group, incremental sort might underperform full sort.
As mentioned in the comments for cost_incremental_sort, it has to
detect the sort groups, plus it needs to perform tuplesort_reset after
each group. However, I doubt these factors would have a substantial
impact on the performance of incremental sort. So maybe in the worst
skewed groups case, incremental sort may still perform similarly to
full sort.Another less obvious reason is that in cases of skewed groups, we may
significantly underestimate the cost of incremental sort. This could
result in choosing a more expensive subpath under the sort. Such as
the example in [1], we end up with IndexScan->IncrementalSort rather
than SeqScan->FullSort, while the former plan is more expensive to
execute. However, this point does not affect this patch, because for
a mergejoin, we only consider outerrel's cheapest-total-cost when we
assume that an explicit sort atop is needed, i.e., we do not have a
chance to select which subpath to use in this case.[1] /messages/by-id/CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com
You don't need any skew. Consider this perfectly uniform dataset:
Sort (cost=127757.34..130257.34 rows=1000000 width=8)
(actual time=186.288..275.813 rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.005..35.989 rows=1000000 loops=1)
Planning Time: 0.075 ms
Execution Time: 301.143 msset enable_incremental_sort = on;
Incremental Sort (cost=1.03..68908.13 rows=1000000 width=8)
(actual time=0.102..497.143 rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 27039 Sort Method: quicksort
Average Memory: 25kB Peak Memory: 25kB
-> Index Scan using t_a_idx on t
(cost=0.42..37244.25 rows=1000000 width=8)
(actual time=0.050..376.403 rows=1000000 loops=1)
Planning Time: 0.057 ms
Execution Time: 519.301 msOk, let's first look at the total Seq Scan cost of the first EXPLAIN.
14425.00 units and 35.989 milliseconds to execute. That's about 400.81
units per millisecond. The Index Scan is only being charged 98.94
units per millisecond of execution. If the Index Scan was costed the
same per unit as the Seq Scan, the total Index Scan cost would be
150868 which is already more than the Seq Scan plan without even
adding the Incremental Sort costs on. To me, that seems to be an
inaccuracy either with the Seq Scan costings coming out too expensive
or Index Scan coming out too cheap.If you think that the Incremental Sort plan shouldn't be chosen
because the Index Scan costing came out too cheap (or the Seq Scan
costing was too expensive) then I disagree. Applying some penalty to
one node type because some other node type isn't costed accurately is
just not a practice we should be doing. Instead, we should be trying
our best to cost each node type as accurately as possible. If you
think there's some inaccuracy with Incremental Sort, then let's look
into that. If you want to re-add the penalty because Index Scan
costing isn't good, let's see if we can fix Index Scan costings.
You're right, this wasn't a good example. I tried to come up with
something quickly, and I didn't realize the extra time comes from the
other node in the plan, not the sort :-(
I vaguely remember there were a couple reports about slow queries
involving incremental sort, but I didn't find any that would show
incremental sort itself being slower. So maybe I'm wrong ...
IIRC the concerns were more about planning - what happens when the
multi-column ndistinct estimates (which are quite shaky) are wrong, etc.
Or how it interacts with LIMIT with hidden correlations, etc.
Sure, those are not problems of incremental sort, it just makes it
easier to hit some of these issues. Of course, that doesn't mean the
1.5x penalty was a particularly good solution.
regards
--
Tomas Vondra
On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
Just looking at the commit message:
The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code. This assumption
does not always hold, particularly in cases with a large skew in the
number of rows within the presorted groups.My understanding is that the worst case for incremental sort is the
same as sort, i.e. only 1 presorted group, which is the same effort to
sort. Is there something I'm missing?
I've pushed this patch after tweaking this part in the commit message.
Thank you both for your reviews.
Thanks
Richard
On 10/10/24 09:18, Richard Guo wrote:
On Sun, Sep 22, 2024 at 1:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
I've pushed this patch after tweaking this part in the commit message.
Thank you both for your reviews.
My apologies for the late review, but IMO there are some minor weaknesses.
Working on the improvement of the cost_sort model [1]/messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com, adding into the
model the number of columns to be sorted and the number of comparisons
to be made (through the stadistinct of the first column), I found out
that one test doesn't like IncrementalSort in aggregates.sql:
'Utilise the ordering of merge join to avoid a Sort operation'
After discovering a little bit, I realised that although the idea that
the IncrementalSort is better than the plain Sort is generally correct,
the assumption that it is better all the time seems wrong.
For example, IncrementalSort, following an index, can choose a sort
order with a low level of distinct values of the first columns, causing
more comparisons - remember, we still sort tuples inside a group; or
using an index on multiple columns having a need in just the first
column and additional Sort on many others from the heap, etc.
Of course, I provide highly skewed cases and Sort + SeqScan basically
will resolve the issue. But I think you may discover both possible sort
ways; don't heuristically give a chance only to IncrementalSort. At
least, IndexScan can beat SeqScan because of the index clause selectivity.
[1]: /messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com
/messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com
--
regards, Andrei Lepikhov