An improvement on parallel DISTINCT
While reviewing Heikki's Omit-junk-columns patchset[1]/messages/by-id/2ca5865b-4693-40e5-8f78-f3b45d5378fb@iki.fi, I noticed that
root->upper_targets[] is used to set target for partial_distinct_rel,
which is not great because root->upper_targets[] is not supposed to be
used by the core code. The comment in grouping_planner() says:
* Save the various upper-rel PathTargets we just computed into
* root->upper_targets[]. The core code doesn't use this, but it
* provides a convenient place for extensions to get at the info.
Then while fixing this issue, I noticed an opportunity for improvement
in how we generate Gather/GatherMerge paths for the two-phase DISTINCT.
The Gather/GatherMerge paths are added by generate_gather_paths(), which
does not consider ordering that might be useful above the GatherMerge
node. This can be improved by using generate_useful_gather_paths()
instead. With this change I can see query plan improvement from the
regression test "select_distinct.sql". For instance,
-- Test parallel DISTINCT
SET parallel_tuple_cost=0;
SET parallel_setup_cost=0;
SET min_parallel_table_scan_size=0;
SET max_parallel_workers_per_gather=2;
-- Ensure we get a parallel plan
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
-- on master
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
-> Sort
Sort Key: four
-> Gather
Workers Planned: 2
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
(8 rows)
-- on patched
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
-> Gather Merge
Workers Planned: 2
-> Sort
Sort Key: four
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
(8 rows)
I believe the second plan is better.
Attached is a patch that includes this change and also eliminates the
usage of root->upper_targets[] in the core code. It also makes some
tweaks for the comment.
Any thoughts?
[1]: /messages/by-id/2ca5865b-4693-40e5-8f78-f3b45d5378fb@iki.fi
/messages/by-id/2ca5865b-4693-40e5-8f78-f3b45d5378fb@iki.fi
Thanks
Richard
Attachments:
v1-0001-Improve-parallel-DISTINCT.patchapplication/octet-stream; name=v1-0001-Improve-parallel-DISTINCT.patchDownload
From 8f7fcf0ffc12407f48c12761a3b4686239d5716d Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Tue, 26 Dec 2023 17:30:36 +0800
Subject: [PATCH v1] Improve parallel DISTINCT
Commit 22c4e88ebf introduced parallel DISTINCT, which is implemented by
adding unique/aggregate paths on top of the input rel's partial paths,
and then adding Gather/GatherMerge paths on top and adding a final
unique/aggregate path at last.
The Gather/GatherMerge paths are added by generate_gather_paths(), which
does not consider ordering that might be useful above the GatherMerge
node. This can be improved by using generate_useful_gather_paths()
instead. As the plan change in regression test shows, this change
improves query plans.
In passing, this patch also avoids the use of root->upper_targets to set
target for partial_distinct_rel, because root->upper_targets is not
supposed to be used by the core code.
---
src/backend/optimizer/path/allpaths.c | 8 ++++----
src/backend/optimizer/plan/planner.c | 13 ++++++++++---
src/test/regress/expected/select_distinct.out | 8 ++++----
3 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 67921a0826..b84bf7ee8b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3053,10 +3053,10 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
*
* If we're generating paths for a scan or join relation, override_rows will
* be false, and we'll just use the relation's size estimate. When we're
- * being called for a partially-grouped path, though, we need to override
- * the rowcount estimate. (It's not clear that the particular value we're
- * using here is actually best, but the underlying rel has no estimate so
- * we must do something.)
+ * being called for a partially-grouped or partially-distinct path, though, we
+ * need to override the rowcount estimate. (It's not clear that the
+ * particular value we're using here is actually best, but the underlying rel
+ * has no estimate so we must do something.)
*/
void
generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6f45efde21..c4698eb951 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1644,8 +1644,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
*/
root->upper_targets[UPPERREL_FINAL] = final_target;
root->upper_targets[UPPERREL_ORDERED] = final_target;
- root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
+ root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
@@ -4751,7 +4751,6 @@ create_partial_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
partial_distinct_rel = fetch_upper_rel(root, UPPERREL_PARTIAL_DISTINCT,
NULL);
- partial_distinct_rel->reltarget = root->upper_targets[UPPERREL_PARTIAL_DISTINCT];
partial_distinct_rel->consider_parallel = input_rel->consider_parallel;
/*
@@ -4872,7 +4871,15 @@ create_partial_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
if (partial_distinct_rel->partial_pathlist != NIL)
{
- generate_gather_paths(root, partial_distinct_rel, true);
+ /*
+ * Set target for partial_distinct_rel as generate_useful_gather_paths
+ * requires that the input rel has a valid reltarget.
+ */
+ partial_distinct_rel->reltarget = cheapest_partial_path->pathtarget;
+
+ /* Consider generating Gather or Gather Merge paths */
+ generate_useful_gather_paths(root, partial_distinct_rel, true);
+
set_cheapest(partial_distinct_rel);
/*
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 9d44ea8056..f32d39f76c 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -235,10 +235,10 @@ SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
- -> Sort
- Sort Key: four
- -> Gather
- Workers Planned: 2
+ -> Gather Merge
+ Workers Planned: 2
+ -> Sort
+ Sort Key: four
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
--
2.31.0
On Wed, 27 Dec 2023 at 00:23, Richard Guo <guofenglinux@gmail.com> wrote:
-- on master
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
-> Sort
Sort Key: four
-> Gather
Workers Planned: 2
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
(8 rows)-- on patched
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
-> Gather Merge
Workers Planned: 2
-> Sort
Sort Key: four
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
(8 rows)I believe the second plan is better.
I wonder if this change is worthwhile. The sort is only required at
all because the planner opted to HashAggregate in phase1, of which the
rows are output unordered. If phase1 was done by Group Aggregate, then
no sorting would be needed. The only reason the planner didn't Hash
Aggregate for phase2 is because of the order we generate the distinct
paths and because of STD_FUZZ_FACTOR.
Look at the costs of the above plan:
Unique (cost=397.24..397.28 rows=4 width=4)
if I enable_sort=0; then I get a cheaper plan:
HashAggregate (cost=397.14..397.18 rows=4 width=4)
If we add more rows then the cost of sorting will grow faster than the
cost of hash aggregate due to the O(N log2 N) part of our sort
costing.
If I drop the index on tenk1(hundred), I only need to go to the
"hundred" column to have it switch to Hash Aggregate on the 2nd phase.
This is because the number of distinct groups costs the paths for
Group Aggregate and Hash Aggregate more than STD_FUZZ_FACTOR apart.
Adjusting the STD_FUZZ_FACTOR with the following means Hash Aggregate
is used for both phases.
-#define STD_FUZZ_FACTOR 1.01
+#define STD_FUZZ_FACTOR 1.0000001
In light of this, do you still think it's worthwhile making this change?
For me, I think all it's going to result in is extra planner work
without any performance gains.
Attached is a patch that includes this change and also eliminates the
usage of root->upper_targets[] in the core code. It also makes some
tweaks for the comment.
We should fix that. We can consider it independently from the other
change you're proposing.
David
On Fri, Feb 2, 2024 at 11:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
In light of this, do you still think it's worthwhile making this change?
For me, I think all it's going to result in is extra planner work
without any performance gains.
Hmm, with the query below, I can see that the new plan is cheaper than
the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.
create table t (a int, b int);
insert into t select i%100000, i from generate_series(1,10000000)i;
analyze t;
-- on master
explain (costs on) select distinct a from t order by a limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Limit (cost=120188.50..120188.51 rows=1 width=4)
-> Sort (cost=120188.50..120436.95 rows=99379 width=4)
Sort Key: a
-> HashAggregate (cost=118697.82..119691.61 rows=99379 width=4)
Group Key: a
-> Gather (cost=97331.33..118200.92 rows=198758 width=4)
Workers Planned: 2
-> HashAggregate (cost=96331.33..97325.12 rows=99379
width=4)
Group Key: a
-> Parallel Seq Scan on t (cost=0.00..85914.67
rows=4166667 width=4)
(10 rows)
-- on patched
explain (costs on) select distinct a from t order by a limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Limit (cost=106573.93..106574.17 rows=1 width=4)
-> Unique (cost=106573.93..130260.88 rows=99379 width=4)
-> Gather Merge (cost=106573.93..129763.98 rows=198758 width=4)
Workers Planned: 2
-> Sort (cost=105573.91..105822.35 rows=99379 width=4)
Sort Key: a
-> HashAggregate (cost=96331.33..97325.12 rows=99379
width=4)
Group Key: a
-> Parallel Seq Scan on t (cost=0.00..85914.67
rows=4166667 width=4)
(9 rows)
It seems that including a LIMIT clause can potentially favor the new
plan.
Thanks
Richard
On Fri, 2 Feb 2024 at 20:47, Richard Guo <guofenglinux@gmail.com> wrote:
On Fri, Feb 2, 2024 at 11:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
In light of this, do you still think it's worthwhile making this change?
For me, I think all it's going to result in is extra planner work
without any performance gains.Hmm, with the query below, I can see that the new plan is cheaper than
the old plan, and the cost difference exceeds STD_FUZZ_FACTOR.create table t (a int, b int);
insert into t select i%100000, i from generate_series(1,10000000)i;
analyze t;explain (costs on) select distinct a from t order by a limit 1;
OK, a LIMIT clause... I didn't think of that. Given the test results
below, I'm pretty convinced we should make the change.
Performance testing on an AMD 3990x with work_mem=4MB and hash_mem_multiplier=2.
$ cat bench.sql
select distinct a from t order by a limit 1;
$ pgbench -n -T 60 -f bench.sql postgres
-- Master
max_parallel_workers_per_gather=2;
latency average = 470.310 ms
latency average = 468.673 ms
latency average = 469.463 ms
max_parallel_workers_per_gather=4;
latency average = 346.012 ms
latency average = 346.662 ms
latency average = 347.591 ms
max_parallel_workers_per_gather=8; + alter table t set (parallel_workers=8);
latency average = 300.298 ms
latency average = 300.029 ms
latency average = 300.314 ms
-- Patched
max_parallel_workers_per_gather=2;
latency average = 424.176 ms
latency average = 431.870 ms
latency average = 431.870 ms (9.36% faster than master)
max_parallel_workers_per_gather=4;
latency average = 279.837 ms
latency average = 280.893 ms
latency average = 281.518 ms (23.51% faster than master)
max_parallel_workers_per_gather=8; + alter table t set (parallel_workers=8);
latency average = 178.585 ms
latency average = 178.780 ms
latency average = 179.768 ms (67.68% faster than master)
So the gains increase with more parallel workers due to pushing more
work to the worker. Amdahl's law approves of this.
I'll push the patch shortly.
David
On Fri, 2 Feb 2024 at 23:39, David Rowley <dgrowleyml@gmail.com> wrote:
I'll push the patch shortly.
I've pushed the partial path sort part.
Now for the other stuff you had. I didn't really like this part:
+ /*
+ * Set target for partial_distinct_rel as generate_useful_gather_paths
+ * requires that the input rel has a valid reltarget.
+ */
+ partial_distinct_rel->reltarget = cheapest_partial_path->pathtarget;
I think we should just make it work the same way as
create_grouping_paths(), where grouping_target is passed as a
parameter.
I've done it that way in the attached.
David
Attachments:
fixup_pathtarget_for_partial_distinct_rel.patchtext/plain; charset=US-ASCII; name=fixup_pathtarget_for_partial_distinct_rel.patchDownload
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 84c4de488a..d404fbf262 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3053,10 +3053,10 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
*
* If we're generating paths for a scan or join relation, override_rows will
* be false, and we'll just use the relation's size estimate. When we're
- * being called for a partially-grouped path, though, we need to override
- * the rowcount estimate. (It's not clear that the particular value we're
- * using here is actually best, but the underlying rel has no estimate so
- * we must do something.)
+ * being called for a partially-grouped or partially-distinct path, though, we
+ * need to override the rowcount estimate. (It's not clear that the
+ * particular value we're using here is actually best, but the underlying rel
+ * has no estimate so we must do something.)
*/
void
generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index acc324122f..be4e182869 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -190,10 +190,12 @@ static void create_one_window_path(PlannerInfo *root,
WindowFuncLists *wflists,
List *activeWindows);
static RelOptInfo *create_distinct_paths(PlannerInfo *root,
- RelOptInfo *input_rel);
+ RelOptInfo *input_rel,
+ PathTarget *target);
static void create_partial_distinct_paths(PlannerInfo *root,
RelOptInfo *input_rel,
- RelOptInfo *final_distinct_rel);
+ RelOptInfo *final_distinct_rel,
+ PathTarget *target);
static RelOptInfo *create_final_distinct_paths(PlannerInfo *root,
RelOptInfo *input_rel,
RelOptInfo *distinct_rel);
@@ -1644,8 +1646,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
*/
root->upper_targets[UPPERREL_FINAL] = final_target;
root->upper_targets[UPPERREL_ORDERED] = final_target;
- root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
+ root->upper_targets[UPPERREL_PARTIAL_DISTINCT] = sort_input_target;
root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
@@ -1695,7 +1697,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
if (parse->distinctClause)
{
current_rel = create_distinct_paths(root,
- current_rel);
+ current_rel,
+ sort_input_target);
}
} /* end of if (setOperations) */
@@ -4568,12 +4571,14 @@ create_one_window_path(PlannerInfo *root,
* Build a new upperrel containing Paths for SELECT DISTINCT evaluation.
*
* input_rel: contains the source-data Paths
+ * target: the pathtarget for the result Paths to compute
*
* Note: input paths should already compute the desired pathtarget, since
* Sort/Unique won't project anything.
*/
static RelOptInfo *
-create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel)
+create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
+ PathTarget *target)
{
RelOptInfo *distinct_rel;
@@ -4601,7 +4606,7 @@ create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel)
create_final_distinct_paths(root, input_rel, distinct_rel);
/* now build distinct paths based on input_rel's partial_pathlist */
- create_partial_distinct_paths(root, input_rel, distinct_rel);
+ create_partial_distinct_paths(root, input_rel, distinct_rel, target);
/* Give a helpful error if we failed to create any paths */
if (distinct_rel->pathlist == NIL)
@@ -4643,7 +4648,8 @@ create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel)
*/
static void
create_partial_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
- RelOptInfo *final_distinct_rel)
+ RelOptInfo *final_distinct_rel,
+ PathTarget *target)
{
RelOptInfo *partial_distinct_rel;
Query *parse;
@@ -4664,7 +4670,7 @@ create_partial_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
partial_distinct_rel = fetch_upper_rel(root, UPPERREL_PARTIAL_DISTINCT,
NULL);
- partial_distinct_rel->reltarget = root->upper_targets[UPPERREL_PARTIAL_DISTINCT];
+ partial_distinct_rel->reltarget = target;
partial_distinct_rel->consider_parallel = input_rel->consider_parallel;
/*
On Fri, Feb 2, 2024 at 6:39 PM David Rowley <dgrowleyml@gmail.com> wrote:
So the gains increase with more parallel workers due to pushing more
work to the worker. Amdahl's law approves of this.I'll push the patch shortly.
Thanks for the detailed testing and pushing the patch!
Thanks
Richard
On Fri, Feb 2, 2024 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
Now for the other stuff you had. I didn't really like this part:
+ /* + * Set target for partial_distinct_rel as generate_useful_gather_paths + * requires that the input rel has a valid reltarget. + */ + partial_distinct_rel->reltarget = cheapest_partial_path->pathtarget;I think we should just make it work the same way as
create_grouping_paths(), where grouping_target is passed as a
parameter.I've done it that way in the attached.
The change looks good to me.
BTW, I kind of doubt that 'create_partial_distinct_paths' is a proper
function name given what it actually does. It not only generates
distinct paths based on input_rel's partial paths, but also adds
Gather/GatherMerge on top of these partially distinct paths, followed by
a final unique/aggregate path to ensure uniqueness of the final result.
So maybe 'create_parallel_distinct_paths' or something like that would
be better?
I asked because I noticed that in create_partial_grouping_paths(), we
only generate partially aggregated paths, and any subsequent
FinalizeAggregate step is called in the caller.
Thanks
Richard
On Mon, 5 Feb 2024 at 14:42, Richard Guo <guofenglinux@gmail.com> wrote:
On Fri, Feb 2, 2024 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:
I think we should just make it work the same way as
create_grouping_paths(), where grouping_target is passed as a
parameter.I've done it that way in the attached.
The change looks good to me.
I pushed the PathTarget changes.
BTW, I kind of doubt that 'create_partial_distinct_paths' is a proper
function name given what it actually does.
I didn't make any changes here. I don't think messing with this is
worth the trouble.
David