Asymmetric partition-wise JOIN
Hello,
PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.
See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.
create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);
create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;
postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
-> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Append (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
-> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
-> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
Planning Time: 0.286 ms
Execution Time: 2333.264 ms
(12 rows)
We can manually rewrite this query as follows:
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
-> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
-> Hash Join (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
-> Hash Join (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
Hash Cond: (p_1.fkey = f_1.pkey)
-> Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
-> Hash Join (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
Hash Cond: (p_2.fkey = f_2.pkey)
-> Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
Planning Time: 0.614 ms
Execution Time: 1396.131 ms
(25 rows)
How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.
Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.
Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>
Hello,
Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.
See the example below:
create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,1000000) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;
ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.
postgres=# explain select * from ptable, t1 where a = aid;
QUERY PLAN
------------------------------------------------------------------------------
Hash Join (cost=2.12..24658.62 rows=49950 width=49)
Hash Cond: (ptable_p0.a = t1.aid)
-> Append (cost=0.00..20407.00 rows=1000000 width=12)
-> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12)
-> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12)
-> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
(8 rows)
The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.
postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
QUERY PLAN
------------------------------------------------------------------------------
Append (cost=2.12..19912.62 rows=49950 width=49)
-> Hash Join (cost=2.12..6552.96 rows=16647 width=49)
Hash Cond: (ptable_p0.a = t1.aid)
-> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
-> Hash Join (cost=2.12..6557.29 rows=16658 width=49)
Hash Cond: (ptable_p1.a = t1.aid)
-> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
-> Hash Join (cost=2.12..6552.62 rows=16645 width=49)
Hash Cond: (ptable_p2.a = t1.aid)
-> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
(16 rows)
We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.
On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.
One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.
How about your thought for this feature?
Best regards,
2019年8月12日(月) 15:03 Kohei KaiGai <kaigai@heterodb.com>:
Hello,
PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
-> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Append (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
-> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
-> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
Planning Time: 0.286 ms
Execution Time: 2333.264 ms
(12 rows)We can manually rewrite this query as follows:
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
-> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
-> Hash Join (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
-> Hash Join (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
Hash Cond: (p_1.fkey = f_1.pkey)
-> Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
-> Hash Join (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
Hash Cond: (p_2.fkey = f_2.pkey)
-> Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
Planning Time: 0.614 ms
Execution Time: 1396.131 ms
(25 rows)How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>
Attachments:
pgsql13-asymmetric-partitionwise-join.v1.patchapplication/octet-stream; name=pgsql13-asymmetric-partitionwise-join.v1.patchDownload
src/backend/optimizer/path/allpaths.c | 9 +-
src/backend/optimizer/path/joinpath.c | 9 ++
src/backend/optimizer/path/joinrels.c | 129 ++++++++++++++++++++++++
src/backend/optimizer/plan/planner.c | 6 +-
src/backend/optimizer/util/appendinfo.c | 12 ++-
src/backend/optimizer/util/relnode.c | 14 +--
src/include/optimizer/paths.h | 10 +-
src/test/regress/expected/partition_join.out | 144 +++++++++++++++++++++++++++
src/test/regress/sql/partition_join.sql | 64 ++++++++++++
9 files changed, 376 insertions(+), 21 deletions(-)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a..69c3cb2 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1275,7 +1275,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
}
/* Add paths to the append relation. */
- add_paths_to_append_rel(root, rel, live_childrels);
+ add_paths_to_append_rel(root, rel, live_childrels, NIL);
}
@@ -1292,7 +1292,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
*/
void
add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
- List *live_childrels)
+ List *live_childrels,
+ List *original_partitioned_rels)
{
List *subpaths = NIL;
bool subpaths_valid = true;
@@ -1304,7 +1305,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *all_child_pathkeys = NIL;
List *all_child_outers = NIL;
ListCell *l;
- List *partitioned_rels = NIL;
+ List *partitioned_rels = original_partitioned_rels;
double partial_rows = -1;
/* If appropriate, consider parallel append */
@@ -3717,7 +3718,7 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
}
/* Build additional paths for this rel from child-join paths. */
- add_paths_to_append_rel(root, rel, live_children);
+ add_paths_to_append_rel(root, rel, live_children, NIL);
list_free(live_children);
}
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index dc28b56..9444758 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -324,6 +324,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 6a480ab..6c35d35 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1532,6 +1533,134 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_join_relids;
+ RelOptInfo *child_join_rel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+ AppendRelInfo **appinfos;
+ int nappinfos;
+
+ child_join_relids = bms_union(child_rel->relids,
+ inner_rel->relids);
+ appinfos = find_appinfos_by_relids(root, child_join_relids,
+ &nappinfos);
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+ nappinfos, appinfos);
+ pfree(appinfos);
+
+ child_join_rel = find_join_rel(root, child_join_relids);
+ if (!child_join_rel)
+ {
+ child_join_rel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ if (!child_join_rel)
+ return NIL;
+ }
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_join_rel,
+ child_sjinfo,
+ child_restrictlist);
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_join_rel->pathlist == NIL)
+ return NIL;
+ set_cheapest(child_join_rel);
+
+ result = lappend(result, child_join_rel);
+ }
+ return result;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ if (!enable_partitionwise_join)
+ return;
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return;
+ if (jointype != JOIN_INNER &&
+ jointype != JOIN_LEFT)
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * MEMO: We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. It is not correct right now, however, a hook
+ * on add_path() to give additional decision for path removel allows
+ * to retain this kind of AppendPath, regardless of its cost.
+ */
+ if (IsA(append_path, AppendPath) &&
+ append_path->partitioned_rels != NIL)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_CATCH();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ root->join_rel_level = join_rel_level_saved;
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels,
+ append_path->partitioned_rels);
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f08..c7a1fc9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7181,7 +7181,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
}
/* Build new paths for this relation by appending child paths. */
- add_paths_to_append_rel(root, rel, live_children);
+ add_paths_to_append_rel(root, rel, live_children, NIL);
}
/*
@@ -7334,7 +7334,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
Assert(partially_grouped_live_children != NIL);
add_paths_to_append_rel(root, partially_grouped_rel,
- partially_grouped_live_children);
+ partially_grouped_live_children, NIL);
/*
* We need call set_cheapest, since the finalization step will use the
@@ -7349,7 +7349,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
{
Assert(grouped_live_children != NIL);
- add_paths_to_append_rel(root, grouped_rel, grouped_live_children);
+ add_paths_to_append_rel(root, grouped_rel, grouped_live_children, NIL);
}
}
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 16d3151..7496cb9 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -187,8 +187,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.nappinfos = nappinfos;
context.appinfos = appinfos;
- /* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/*
* Must be prepared to start with a Query or a bare expression tree.
@@ -709,11 +710,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -721,9 +722,10 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 8541538..c1a36b5 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -778,11 +778,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -837,8 +834,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137..e70ebe7 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern void mark_dummy_rel(RelOptInfo *rel);
extern bool have_partkey_equi_join(RelOptInfo *joinrel,
RelOptInfo *rel1, RelOptInfo *rel2,
JoinType jointype, List *restrictlist);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
@@ -233,6 +238,7 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
EquivalenceClass *eclass, Oid opfamily,
int strategy, bool nulls_first);
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
- List *live_childrels);
+ List *live_childrels,
+ List *original_partitioned_rels);
#endif /* PATHS_H */
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 1296edc..d838713 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2003,3 +2003,147 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
Filter: (b = 0)
(16 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_p0.b = t5_2.bid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_p1.b = t5_2.bid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_p2.b = t5_2.bid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0
+ -> Seq Scan on prt5_p1
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0
+ -> Seq Scan on prt5_p1
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index db9a6b4..8ad54b4 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -435,3 +435,67 @@ ANALYZE prt2;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
+
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+
+
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.
What if you include the partition constraint as a filter on t1? So you get:
ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))
Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.
Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.
(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)
One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.
Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.
--
Thomas Munro
https://enterprisedb.com
2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>:
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.What if you include the partition constraint as a filter on t1? So you get:
ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)
It requires the join-key must include the partition key and also must be
equality-join, doesn't it?
If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
the partition bounds, indeed.
In case when some of partition leafs are pruned, it is exactly beneficial
because relevant rows to be referenced by the pruned child relations
are waste of memory.
On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling join-nodess.
One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.
Hmm. Do you intend the inner-path may have different behavior according
to the partition bounds definition where the outer-path to be joined?
Let me investigate its pros & cons.
The reasons why I think the idea of sharing the same hash table is reasonable
in this scenario are:
1. We can easily extend the idea for parallel optimization. A hash table on DSM
segment, once built, can be shared by all the siblings in all the
parallel workers.
2. We can save the memory consumption regardless of the join-keys and
partition-keys, even if these are not involved in the query.
On the other hands, below are the downside. Potentially, combined use of
your idea may help these cases:
3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
4. Hash table contains rows to be referenced by only pruned partition leafs.
Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>
On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:
On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling
join-nodess.
The patch crashes when running the regression tests, per the report of
the automatic patch tester. Could you look at that? I have moved the
patch to nexf CF, waiting on author.
--
Michael
Hello,
This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.
The attached v2 fixed the problem, and regression test finished correctly.
Best regards,
2019年12月1日(日) 12:24 Michael Paquier <michael@paquier.xyz>:
On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:
On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling
join-nodess.The patch crashes when running the regression tests, per the report of
the automatic patch tester. Could you look at that? I have moved the
patch to nexf CF, waiting on author.
--
Michael
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>
Attachments:
pgsql13-asymmetric-partitionwise-join.v2.patchapplication/octet-stream; name=pgsql13-asymmetric-partitionwise-join.v2.patchDownload
src/backend/optimizer/path/allpaths.c | 9 +-
src/backend/optimizer/path/joinpath.c | 9 ++
src/backend/optimizer/path/joinrels.c | 129 ++++++++++++++++++++++++
src/backend/optimizer/plan/planner.c | 6 +-
src/backend/optimizer/util/appendinfo.c | 18 +++-
src/backend/optimizer/util/relnode.c | 14 +--
src/include/optimizer/paths.h | 10 +-
src/test/regress/expected/partition_join.out | 144 +++++++++++++++++++++++++++
src/test/regress/sql/partition_join.sql | 64 ++++++++++++
9 files changed, 382 insertions(+), 21 deletions(-)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..69c3cb2e8d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1275,7 +1275,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
}
/* Add paths to the append relation. */
- add_paths_to_append_rel(root, rel, live_childrels);
+ add_paths_to_append_rel(root, rel, live_childrels, NIL);
}
@@ -1292,7 +1292,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
*/
void
add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
- List *live_childrels)
+ List *live_childrels,
+ List *original_partitioned_rels)
{
List *subpaths = NIL;
bool subpaths_valid = true;
@@ -1304,7 +1305,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *all_child_pathkeys = NIL;
List *all_child_outers = NIL;
ListCell *l;
- List *partitioned_rels = NIL;
+ List *partitioned_rels = original_partitioned_rels;
double partial_rows = -1;
/* If appropriate, consider parallel append */
@@ -3717,7 +3718,7 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
}
/* Build additional paths for this rel from child-join paths. */
- add_paths_to_append_rel(root, rel, live_children);
+ add_paths_to_append_rel(root, rel, live_children, NIL);
list_free(live_children);
}
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index dc28b56e74..9444758a08 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -324,6 +324,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 6a480ab764..6c35d35651 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1532,6 +1533,134 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_join_relids;
+ RelOptInfo *child_join_rel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+ AppendRelInfo **appinfos;
+ int nappinfos;
+
+ child_join_relids = bms_union(child_rel->relids,
+ inner_rel->relids);
+ appinfos = find_appinfos_by_relids(root, child_join_relids,
+ &nappinfos);
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+ nappinfos, appinfos);
+ pfree(appinfos);
+
+ child_join_rel = find_join_rel(root, child_join_relids);
+ if (!child_join_rel)
+ {
+ child_join_rel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ if (!child_join_rel)
+ return NIL;
+ }
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_join_rel,
+ child_sjinfo,
+ child_restrictlist);
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_join_rel->pathlist == NIL)
+ return NIL;
+ set_cheapest(child_join_rel);
+
+ result = lappend(result, child_join_rel);
+ }
+ return result;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ if (!enable_partitionwise_join)
+ return;
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return;
+ if (jointype != JOIN_INNER &&
+ jointype != JOIN_LEFT)
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * MEMO: We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. It is not correct right now, however, a hook
+ * on add_path() to give additional decision for path removel allows
+ * to retain this kind of AppendPath, regardless of its cost.
+ */
+ if (IsA(append_path, AppendPath) &&
+ append_path->partitioned_rels != NIL)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_CATCH();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ root->join_rel_level = join_rel_level_saved;
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels,
+ append_path->partitioned_rels);
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d63ebb7287..22424e787e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7186,7 +7186,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
}
/* Build new paths for this relation by appending child paths. */
- add_paths_to_append_rel(root, rel, live_children);
+ add_paths_to_append_rel(root, rel, live_children, NIL);
}
/*
@@ -7339,7 +7339,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
Assert(partially_grouped_live_children != NIL);
add_paths_to_append_rel(root, partially_grouped_rel,
- partially_grouped_live_children);
+ partially_grouped_live_children, NIL);
/*
* We need call set_cheapest, since the finalization step will use the
@@ -7354,7 +7354,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
{
Assert(grouped_live_children != NIL);
- add_paths_to_append_rel(root, grouped_rel, grouped_live_children);
+ add_paths_to_append_rel(root, grouped_rel, grouped_live_children, NIL);
}
}
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 1890f256de..67b7470dbb 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -201,8 +201,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.nappinfos = nappinfos;
context.appinfos = appinfos;
- /* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/*
* Must be prepared to start with a Query or a bare expression tree.
@@ -574,6 +575,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -588,12 +590,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -723,11 +730,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -735,9 +742,10 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index c9eb447d07..2613e613c0 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -778,11 +778,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -837,8 +834,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c6c34630c2..07302a2b2d 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern void mark_dummy_rel(RelOptInfo *rel);
extern bool have_partkey_equi_join(RelOptInfo *joinrel,
RelOptInfo *rel1, RelOptInfo *rel2,
JoinType jointype, List *restrictlist);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
@@ -238,6 +243,7 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
EquivalenceClass *eclass, Oid opfamily,
int strategy, bool nulls_first);
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
- List *live_childrels);
+ List *live_childrels,
+ List *original_partitioned_rels);
#endif /* PATHS_H */
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index b3fbe47bde..da04359c91 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2082,3 +2082,147 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
Filter: (b = 0)
(16 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 575ba7b8d4..ebf9649508 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -453,3 +453,67 @@ ANALYZE prt2;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
+
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+
+
Hi Thomas,
On 12/27/19 2:34 AM, Kohei KaiGai wrote:
This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.The attached v2 fixed the problem, and regression test finished correctly.
Any thoughts on the new version of this patch?
Regards,
--
-David
david@pgmasters.net
On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
The attached v2 fixed the problem, and regression test finished correctly.
This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.
cheers ./daniel
On 12/27/19 12:34 PM, Kohei KaiGai wrote:
The attached v2 fixed the problem, and regression test finished correctly.
Using your patch I saw incorrect value of predicted rows at the top node
of the plan: "Append (cost=270.02..35165.37 rows=40004 width=16)"
Full explain of the query plan see in attachment -
explain_with_asymmetric.sql
if I disable enable_partitionwise_join then:
"Hash Join (cost=270.02..38855.25 rows=10001 width=16)"
Full explain - explain_no_asymmetric.sql
I thought that is the case of incorrect usage of cached values of
norm_selec, but it is a corner-case problem of the eqjoinsel() routine :
selectivity = 1/size_of_larger_relation; (selfuncs.c:2567)
tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607)
i.e. number of tuples depends only on size of smaller relation.
It is not a bug of your patch but I think you need to know because it
may affect on planner decision.
===
P.S. Test case:
CREATE TABLE t0 (a serial, b int);
INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g);
CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a)
INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g);
--
regards,
Andrey Lepikhov
Postgres Professional
On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
The attached v2 fixed the problem, and regression test finished correctly.
This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.
Rebased version of the patch on current master (d259afa736).
I rebased it because it is a base of my experimental feature than we
don't break partitionwise join of a relation with foreign partition and
a local relation if we have info that remote server has foreign table
link to the local relation (by analogy with shippable extensions).
Maybe mark as 'Needs review'?
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
pgsql13-asymmetric-partitionwise-join.v3.patchtext/x-patch; charset=UTF-8; name=pgsql13-asymmetric-partitionwise-join.v3.patchDownload
From 8dda8c4ba29ed4b2a54f66746ebedd9ab0bfded9 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 21 Aug 2020 10:38:59 +0500
Subject: [PATCH] Add one more planner strategy to JOIN with a partitioned
relation.
Try to join inner relation with each partition of outer relation
and append results.
This strategy has potential benefits because it allows partitionwise
join with an unpartitioned relation or with a relation that is
partitioned by another schema.
---
src/backend/optimizer/path/allpaths.c | 9 +-
src/backend/optimizer/path/joinpath.c | 9 ++
src/backend/optimizer/path/joinrels.c | 132 +++++++++++++++++
src/backend/optimizer/plan/planner.c | 6 +-
src/backend/optimizer/util/appendinfo.c | 18 ++-
src/backend/optimizer/util/relnode.c | 14 +-
src/include/optimizer/paths.h | 10 +-
src/test/regress/expected/partition_join.out | 145 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 63 ++++++++
9 files changed, 385 insertions(+), 21 deletions(-)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..4f110c5a2f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1278,7 +1278,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
}
/* Add paths to the append relation. */
- add_paths_to_append_rel(root, rel, live_childrels);
+ add_paths_to_append_rel(root, rel, live_childrels, NIL);
}
@@ -1295,7 +1295,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
*/
void
add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
- List *live_childrels)
+ List *live_childrels,
+ List *original_partitioned_rels)
{
List *subpaths = NIL;
bool subpaths_valid = true;
@@ -1307,7 +1308,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *all_child_pathkeys = NIL;
List *all_child_outers = NIL;
ListCell *l;
- List *partitioned_rels = NIL;
+ List *partitioned_rels = original_partitioned_rels;
double partial_rows = -1;
/* If appropriate, consider parallel append */
@@ -3950,7 +3951,7 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
}
/* Build additional paths for this rel from child-join paths. */
- add_paths_to_append_rel(root, rel, live_children);
+ add_paths_to_append_rel(root, rel, live_children, NIL);
list_free(live_children);
}
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..36464e31aa 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -324,6 +324,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..4a7d0d0604 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1551,6 +1552,137 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_join_relids;
+ RelOptInfo *child_join_rel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+ AppendRelInfo **appinfos;
+ int nappinfos;
+
+ child_join_relids = bms_union(child_rel->relids,
+ inner_rel->relids);
+ appinfos = find_appinfos_by_relids(root, child_join_relids,
+ &nappinfos);
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+ nappinfos, appinfos);
+ pfree(appinfos);
+
+ child_join_rel = find_join_rel(root, child_join_relids);
+ if (!child_join_rel)
+ {
+ child_join_rel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ if (!child_join_rel)
+ return NIL;
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_join_rel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_join_rel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_join_rel);
+ result = lappend(result, child_join_rel);
+ }
+ return result;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ if (!enable_partitionwise_join)
+ return;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * MEMO: We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. It is not correct right now, however, a hook
+ * on add_path() to give additional decision for path removel allows
+ * to retain this kind of AppendPath, regardless of its cost.
+ */
+ if (IsA(append_path, AppendPath) &&
+ append_path->partitioned_rels != NIL)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_CATCH();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ root->join_rel_level = join_rel_level_saved;
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels,
+ append_path->partitioned_rels);
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b40a112c25..863fb79f03 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7536,7 +7536,7 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
}
/* Build new paths for this relation by appending child paths. */
- add_paths_to_append_rel(root, rel, live_children);
+ add_paths_to_append_rel(root, rel, live_children, NIL);
}
/*
@@ -7689,7 +7689,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
Assert(partially_grouped_live_children != NIL);
add_paths_to_append_rel(root, partially_grouped_rel,
- partially_grouped_live_children);
+ partially_grouped_live_children, NIL);
/*
* We need call set_cheapest, since the finalization step will use the
@@ -7704,7 +7704,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
{
Assert(grouped_live_children != NIL);
- add_paths_to_append_rel(root, grouped_rel, grouped_live_children);
+ add_paths_to_append_rel(root, grouped_rel, grouped_live_children, NIL);
}
}
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..32230cf877 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -201,8 +201,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.nappinfos = nappinfos;
context.appinfos = appinfos;
- /* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/*
* Must be prepared to start with a Query or a bare expression tree.
@@ -579,6 +580,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -593,12 +595,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -728,11 +735,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -740,9 +747,10 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index a203e6f1ff..7a52b14fff 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -791,11 +791,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -852,8 +849,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9949ff1d5d 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -108,7 +108,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
@@ -239,6 +244,7 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
EquivalenceClass *eclass, Oid opfamily,
int strategy, bool nulls_first);
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
- List *live_childrels);
+ List *live_childrels,
+ List *original_partitioned_rels);
#endif /* PATHS_H */
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 585e724375..01a0a17aac 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2330,6 +2330,151 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 73606c86e5..7b1c5cc30b 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,69 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.25.1
On 21 Aug 2020, at 08:02, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
The attached v2 fixed the problem, and regression test finished correctly.This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.Rebased version of the patch on current master (d259afa736).
I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation with foreign partition and a local relation if we have info that remote server has foreign table link to the local relation (by analogy with shippable extensions).
Maybe mark as 'Needs review'?
Thanks for the rebase, I've updated the commitfest entry to reflect that it
needs a round of review.
cheers ./daniel
On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai <kaigai@heterodb.com> wrote:
2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>:
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.What if you include the partition constraint as a filter on t1? So you get:
ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)It requires the join-key must include the partition key and also must be
equality-join, doesn't it?
If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
the partition bounds, indeed.In case when some of partition leafs are pruned, it is exactly beneficial
because relevant rows to be referenced by the pruned child relations
are waste of memory.On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling join-nodess.One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.Hmm. Do you intend the inner-path may have different behavior according
to the partition bounds definition where the outer-path to be joined?
Let me investigate its pros & cons.The reasons why I think the idea of sharing the same hash table is reasonable
in this scenario are:
1. We can easily extend the idea for parallel optimization. A hash table on DSM
segment, once built, can be shared by all the siblings in all the
parallel workers.
2. We can save the memory consumption regardless of the join-keys and
partition-keys, even if these are not involved in the query.On the other hands, below are the downside. Potentially, combined use of
your idea may help these cases:
3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
4. Hash table contains rows to be referenced by only pruned partition leafs.
+ many, for the sharable hash of the inner table of the join. IMHO,
this could be the most interesting and captivating thing about this feature.
But might be a complicated piece, is that still on the plan?
Regards,
Amul
On 21.08.2020 09:02, Andrey V. Lepikhov wrote:
On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
The attached v2 fixed the problem, and regression test finished
correctly.This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.Rebased version of the patch on current master (d259afa736).
I rebased it because it is a base of my experimental feature than we
don't break partitionwise join of a relation with foreign partition
and a local relation if we have info that remote server has foreign
table link to the local relation (by analogy with shippable extensions).Maybe mark as 'Needs review'?
Status update for a commitfest entry.
According to cfbot, the patch fails to apply. Could you please send a
rebased version?
This thread was inactive for quite some time. Is anyone going to
continue working on it?
I see some interest in the idea of sharable hash, but I don't see even a
prototype in this thread. So, probably, it is a matter of a separate
discussion.
Also, I took a look at the code. It looks like it needs some extra work.
I am not a big expert in this area, so I'm sorry if questions are obvious.
1. What would happen if this assumption is not met?
+ * MEMO: We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. It is not correct right now, however, a
hook
+ * on add_path() to give additional decision for path removel
allows
+ * to retain this kind of AppendPath, regardless of its cost.
2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into
PG_TRY/PG_CATCH? What errors do we expect?
3. It looks like a crutch. If it isn't, I'd like to see a better comment
about why "dynamic programming" is not applicable here.
And shouldn't we also handle a root->join_cur_level?
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
4. This change looks like it can lead to a memory leak for old code.
Maybe it is never the case, but again I think it worth a comment.
- /* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
5. extract_asymmetric_partitionwise_subjoin() lacks a comment
The new status of this patch is: Waiting on Author
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 09.11.2020 13:53, Anastasia Lubennikova wrote:
On 21.08.2020 09:02, Andrey V. Lepikhov wrote:
On 7/1/20 2:10 PM, Daniel Gustafsson wrote:
On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
The attached v2 fixed the problem, and regression test finished
correctly.This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.Rebased version of the patch on current master (d259afa736).
I rebased it because it is a base of my experimental feature than we
don't break partitionwise join of a relation with foreign partition
and a local relation if we have info that remote server has foreign
table link to the local relation (by analogy with shippable extensions).Maybe mark as 'Needs review'?
Status update for a commitfest entry.
According to cfbot, the patch fails to apply. Could you please send a
rebased version?This thread was inactive for quite some time. Is anyone going to
continue working on it?I see some interest in the idea of sharable hash, but I don't see even
a prototype in this thread. So, probably, it is a matter of a separate
discussion.Also, I took a look at the code. It looks like it needs some extra
work. I am not a big expert in this area, so I'm sorry if questions
are obvious.1. What would happen if this assumption is not met?
+ * MEMO: We assume this pathlist keeps at least one AppendPath that + * represents partitioned table-scan, symmetric or asymmetric + * partition-wise join. It is not correct right now, however, a hook + * on add_path() to give additional decision for path removel allows + * to retain this kind of AppendPath, regardless of its cost.2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into
PG_TRY/PG_CATCH? What errors do we expect?3. It looks like a crutch. If it isn't, I'd like to see a better
comment about why "dynamic programming" is not applicable here.
And shouldn't we also handle a root->join_cur_level?+ /* temporary disables "dynamic programming" algorithm */ + root->join_rel_level = NULL;4. This change looks like it can lead to a memory leak for old code.
Maybe it is never the case, but again I think it worth a comment.- /* If there's nothing to adjust, don't call this function. */ - Assert(nappinfos >= 1 && appinfos != NULL); + /* If there's nothing to adjust, just return a duplication */ + if (nappinfos == 0) + return copyObject(node);5. extract_asymmetric_partitionwise_subjoin() lacks a comment
The new status of this patch is: Waiting on Author
Status update for a commitfest entry.
This entry was inactive during this CF, so I've marked it as returned
with feedback. Feel free to resubmit an updated version to a future
commitfest.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
This entry was inactive during this CF, so I've marked it as returned
with feedback. Feel free to resubmit an updated version to a future
commitfest.
Attached version is rebased on current master and fixes problems with
complex parameterized plans - 'reparameterize by child' feature.
Problems with reparameterization machinery can be demonstrated by TPC-H
benchmark.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
0001-Asymmetric-partitionwise-join.patchtext/x-patch; charset=UTF-8; name=0001-Asymmetric-partitionwise-join.patchDownload
From 6a15a52bfb90659c51b3a918d48037c474ffe9dd Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
This technique cause changes of 'reparameterize by child' machinery.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 151 ++++++++++++++
src/backend/optimizer/util/appendinfo.c | 28 ++-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 14 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 209 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 99 +++++++++
8 files changed, 509 insertions(+), 17 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index e9b6968b1d..6ba6d32ae4 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..6f900475bb 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1551,6 +1552,156 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_join_relids;
+ RelOptInfo *child_join_rel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+ AppendRelInfo **appinfos;
+ int nappinfos;
+
+ child_join_relids = bms_union(child_rel->relids,
+ inner_rel->relids);
+ appinfos = find_appinfos_by_relids(root, child_join_relids,
+ &nappinfos);
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs(root, (Node *)extra->restrictlist,
+ nappinfos, appinfos);
+ pfree(appinfos);
+
+ child_join_rel = find_join_rel(root, child_join_relids);
+ if (!child_join_rel)
+ {
+ child_join_rel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ if (!child_join_rel)
+ {
+ /*
+ * If can't build JOIN between inner relation and one of the outer
+ * partitions - return immediately.
+ */
+ return NIL;
+ }
+ }
+ else
+ {
+ /*
+ * TODO:
+ * Can't imagine situation when join relation already exists. But in
+ * the 'partition_join' regression test it happens.
+ * It may be an indicator of possible problems.
+ */
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_join_rel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_join_rel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_join_rel);
+ result = lappend(result, child_join_rel);
+ }
+ return result;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ if (!enable_partitionwise_join)
+ return;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * MEMO: We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. It is not correct right now, however, a hook
+ * on add_path() to give additional decision for path removel allows
+ * to retain this kind of AppendPath, regardless of its cost.
+ */
+ if (IsA(append_path, AppendPath))
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_CATCH();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+ root->join_rel_level = join_rel_level_saved;
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..be4624b619 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -494,8 +496,6 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
int nappinfos;
int cnt;
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
-
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of given child. */
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -565,6 +570,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,12 +585,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -715,11 +726,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +738,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b248b038e0..73cfb4748d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4209,7 +4209,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..f98a82e725 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -790,11 +790,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -851,8 +848,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 035d3e1206..d64aa37f80 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..2bc7ad70d9 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,215 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..c206fd736a 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,105 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+
+VACUUM ANALYZE prt5;
+VACUUM ANALYZE t5_1;
+VACUUM ANALYZE t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.25.1
On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:
This entry was inactive during this CF, so I've marked it as returned
with feedback. Feel free to resubmit an updated version to a future
commitfest.
I return the patch to commitfest. My current reason differs from reason
of origin author.
This patch can open a door for more complex optimizations in the
partitionwise join push-down technique.
I mean, we can push-down join not only of two partitioned tables with
the same partition schema, but a partitioned (sharded) table with an
arbitrary subplan that is provable independent of local resources.
Example:
CREATE TABLE p(a int) PARTITION BY HASH (a);
CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2);
SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s
WHERE p.a=s.a;
Hash Join
Hash Cond: (p.a = a.a)
-> Append
-> Seq Scan on p1 p_1
-> Seq Scan on p2 p_2
-> Seq Scan on p3 p_3
-> Hash
-> Function Scan on generate_series a
But with asymmetric join feature we have the plan:
Append
-> Hash Join
Hash Cond: (p_1.a = a.a)
-> Seq Scan on p1 p_1
-> Hash
-> Function Scan on generate_series a
-> Hash Join
Hash Cond: (p_2.a = a.a)
-> Seq Scan on p2 p_2
-> Hash
-> Function Scan on generate_series a
-> Hash Join
Hash Cond: (p_3.a = a.a)
-> Seq Scan on p3 p_3
-> Hash
-> Function Scan on generate_series a
In the case of FDW-sharding it means that if we can prove that the inner
relation is independent from the execution server, we can push-down
these joins and execute it in parallel.
--
regards,
Andrey Lepikhov
Postgres Professional
Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
v14-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v14-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download
From 101614b504b0b17e201d2375c8af61cfc671e51d Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 184 +++++++++++
src/backend/optimizer/plan/setrefs.c | 13 +-
src/backend/optimizer/util/appendinfo.c | 37 ++-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 306 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 126 ++++++++
9 files changed, 682 insertions(+), 28 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index b67b517770..7a1a7b2e86 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..e6bd2ea5fe 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1551,6 +1552,189 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_join_relids;
+ Relids parent_relids;
+ RelOptInfo *child_join_rel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_join_relids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_join_relids, parent_relids);
+
+ child_join_rel = find_join_rel(root, child_join_relids);
+ if (!child_join_rel)
+ {
+ child_join_rel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ if (!child_join_rel)
+ {
+ /*
+ * If can't build JOIN between inner relation and one of the outer
+ * partitions - return immediately.
+ */
+ return NIL;
+ }
+ }
+ else
+ {
+ /*
+ * TODO:
+ * Can't imagine situation when join relation already exists. But in
+ * the 'partition_join' regression test it happens.
+ * It may be an indicator of possible problems.
+ */
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_join_rel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_join_rel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_join_rel);
+ result = lappend(result, child_join_rel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_capable(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure will
+ * lead to large memory allocations and a CPU consumption:
+ * each reparameterize will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of partitions
+ * in the inner. Also, if we have many partitions, each bitmapset
+ * variable will large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_capable(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * MEMO: We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. It is not correct right now, however, a hook
+ * on add_path() to give additional decision for path removal allows
+ * to retain this kind of AppendPath, regardless of its cost.
+ */
+ if (IsA(append_path, AppendPath))
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 61ccfd300b..dbe568aa29 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -287,18 +287,23 @@ set_plan_references(PlannerInfo *root, Plan *plan)
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* Now fix the Plan tree */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..ec8dee99b0 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -489,12 +491,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids child_relids,
Relids top_parent_relids)
{
- AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
- int nappinfos;
- int cnt;
-
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
+ AppendRelInfo **appinfos;
+ Relids parent_relids = NULL;
+ int nappinfos;
+ int cnt;
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,6 +520,7 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
@@ -565,6 +571,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,12 +586,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -715,11 +727,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +739,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b248b038e0..73cfb4748d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4209,7 +4209,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..bb15fb2716 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -790,11 +790,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -851,8 +848,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2033,9 +2033,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..a0106dc539 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..601f08662e 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,312 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(12 rows)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(31 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..51b81ce3f8 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,132 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.31.1
Andrey Lepikhov писал 2021-05-27 07:27:
Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.
Hi.
I've tested this patch and haven't found issues, but I have some
comments.
src/backend/optimizer/path/joinrels.c:
1554
1555 /*
1556 * Build RelOptInfo on JOIN of each partition of the outer relation
and the inner
1557 * relation. Return List of such RelOptInfo's. Return NIL, if at
least one of
1558 * these JOINs are impossible to build.
1559 */
1560 static List *
1561 extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
1562
RelOptInfo *joinrel,
1563
AppendPath *append_path,
1564
RelOptInfo *inner_rel,
1565
JoinType jointype,
1566
JoinPathExtraData *extra)
1567 {
1568 List *result = NIL;
1569 ListCell *lc;
1570
1571 foreach (lc, append_path->subpaths)
1572 {
1573 Path *child_path = lfirst(lc);
1574 RelOptInfo *child_rel =
child_path->parent;
1575 Relids child_join_relids;
1576 Relids parent_relids;
1577 RelOptInfo *child_join_rel;
1578 SpecialJoinInfo *child_sjinfo;
1579 List *child_restrictlist;
Variable names - child_join_rel and child_join_relids seem to be
inconsistent with rest of the file (I see child_joinrelids in
try_partitionwise_join() and child_joinrel in try_partitionwise_join()
and get_matching_part_pairs()).
1595 child_join_rel = build_child_join_rel(root,
1596
child_rel,
1597
inner_rel,
1598
joinrel,
1599
child_restrictlist,
1600
child_sjinfo,
1601
jointype);
1602 if (!child_join_rel)
1603 {
1604 /*
1605 * If can't build JOIN between
inner relation and one of the outer
1606 * partitions - return immediately.
1607 */
1608 return NIL;
1609 }
When build_child_join_rel() can return NULL?
If I read code correctly, joinrel is created in the begining of
build_child_join_rel() with makeNode(), makeNode() wraps newNode() and
newNode() uses MemoryContextAllocZero()/MemoryContextAllocZeroAligned(),
which would error() on alloc() failure.
1637
1638 static bool
1639 is_asymmetric_join_capable(PlannerInfo *root,
1640 RelOptInfo
*outer_rel,
1641 RelOptInfo
*inner_rel,
1642 JoinType
jointype)
1643 {
Function misses a comment.
1656 /*
1657 * Don't allow asymmetric JOIN of two append subplans.
1658 * In the case of a parameterized NL join, a
reparameterization procedure will
1659 * lead to large memory allocations and a CPU consumption:
1660 * each reparameterize will induce subpath duplication,
creating new
1661 * ParamPathInfo instance and increasing of ppilist up to
number of partitions
1662 * in the inner. Also, if we have many partitions, each
bitmapset
1663 * variable will large and many leaks of such variable
(caused by relid
1664 * replacement) will highly increase memory consumption.
1665 * So, we deny such paths for now.
1666 */
Missing word:
each bitmapset variable will large => each bitmapset variable will be
large
1694 foreach (lc, outer_rel->pathlist)
1695 {
1696 AppendPath *append_path = lfirst(lc);
1697
1698 /*
1699 * MEMO: We assume this pathlist keeps at least one
AppendPath that
1700 * represents partitioned table-scan, symmetric or
asymmetric
1701 * partition-wise join. It is not correct right
now, however, a hook
1702 * on add_path() to give additional decision for
path removal allows
1703 * to retain this kind of AppendPath, regardless of
its cost.
1704 */
1705 if (IsA(append_path, AppendPath))
What hook do you refer to?
src/backend/optimizer/plan/setrefs.c:
282 /*
283 * Adjust RT indexes of AppendRelInfos and add to final
appendrels list.
284 * We assume the AppendRelInfos were built during planning
and don't need
285 * to be copied.
286 */
287 foreach(lc, root->append_rel_list)
288 {
289 AppendRelInfo *appinfo = lfirst_node(AppendRelInfo,
lc);
290 AppendRelInfo *newappinfo;
291
292 /* flat copy is enough since all valuable fields are
scalars */
293 newappinfo = (AppendRelInfo *)
palloc(sizeof(AppendRelInfo));
294 memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
You've changed function to copy appinfo, so now comment is incorrect.
src/backend/optimizer/util/appendinfo.c:
588 /* Construct relids set for the immediate parent of the
given child. */
589 normal_relids = bms_copy(child_relids);
590 for (cnt = 0; cnt < nappinfos; cnt++)
591 {
592 AppendRelInfo *appinfo = appinfos[cnt];
593
594 parent_relids = bms_add_member(parent_relids,
appinfo->parent_relid);
595 normal_relids = bms_del_member(normal_relids,
appinfo->child_relid);
596 }
597 parent_relids = bms_union(parent_relids, normal_relids);
Do I understand correctly that now parent_relids also contains relids of
relations from 'global' inner relation, which we join to childs?
--
Best regards,
Alexander Pyhalov,
Postgres Professional
On 18/6/21 15:02, Alexander Pyhalov wrote:
Andrey Lepikhov писал 2021-05-27 07:27:
Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.Hi.
I've tested this patch and haven't found issues, but I have some comments.
Thank you for review!
Variable names - child_join_rel and child_join_relids seem to be
inconsistent with rest of the file
fixed
When build_child_join_rel() can return NULL?
Fixed
Missing word:
each bitmapset variable will large => each bitmapset variable will be large
Fixed
What hook do you refer to?
Removed> You've changed function to copy appinfo, so now comment is
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also
contains relids of
relations from 'global' inner relation, which we join to childs?
Yes
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
v15-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v15-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download
From d653b4f65486e2cefafda61811390c4095eae371 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 172 +++++++++++
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 37 ++-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 306 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 126 ++++++++
9 files changed, 672 insertions(+), 30 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index b67b517770..7a1a7b2e86 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..2839b7acc3 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1551,6 +1552,177 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs are impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_joinrelids;
+ Relids parent_relids;
+ RelOptInfo *child_joinrel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_joinrelids, parent_relids);
+
+ child_joinrel = find_join_rel(root, child_joinrelids);
+ if (!child_joinrel)
+ child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ else
+ {
+ /*
+ * TODO:
+ * Can't imagine situation when join relation already exists. But in
+ * the 'partition_join' regression test it happens.
+ * It may be an indicator of possible problems.
+ */
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_joinrel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_joinrel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_joinrel);
+ result = lappend(result, child_joinrel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_capable(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure
+ * will lead to large memory allocations and a CPU consumption:
+ * each reparameterization will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of
+ * partitions in the inner. Also, if we have many partitions, each bitmapset
+ * variable will be large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_capable(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join.
+ */
+ if (IsA(append_path, AppendPath))
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 61ccfd300b..834843ba33 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan its could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* Now fix the Plan tree */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..ec8dee99b0 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -489,12 +491,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids child_relids,
Relids top_parent_relids)
{
- AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
- int nappinfos;
- int cnt;
-
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
+ AppendRelInfo **appinfos;
+ Relids parent_relids = NULL;
+ int nappinfos;
+ int cnt;
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,6 +520,7 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
@@ -565,6 +571,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,12 +586,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -715,11 +727,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +739,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 9ce5f95e3b..03c0640473 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4226,7 +4226,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..bb15fb2716 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -790,11 +790,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -851,8 +848,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2033,9 +2033,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..a0106dc539 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..601f08662e 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,312 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(12 rows)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(31 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..51b81ce3f8 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,132 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.31.1
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru>
wrote:
On 18/6/21 15:02, Alexander Pyhalov wrote:
Andrey Lepikhov писал 2021-05-27 07:27:
Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.Hi.
I've tested this patch and haven't found issues, but I have somecomments.
Thank you for review!Variable names - child_join_rel and child_join_relids seem to be
inconsistent with rest of the filefixed
When build_child_join_rel() can return NULL?
Fixed
Missing word:
each bitmapset variable will large => each bitmapset variable will belarge
FixedWhat hook do you refer to?
Removed> You've changed function to copy appinfo, so now comment is
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also
contains relids ofrelations from 'global' inner relation, which we join to childs?
Yes
--
regards,
Andrey Lepikhov
Postgres Professional
Hi,
relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.
CPU and memory huge consumption -> huge consumption of CPU and memory
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one
of
+ * these JOINs are impossible to build.
at least one of these JOINs are impossible to build. -> at least one
of these JOINs is impossible to build.
+ * Can't imagine situation when join relation already exists.
But in
+ * the 'partition_join' regression test it happens.
+ * It may be an indicator of possible problems.
Should a log be added in the above case ?
+is_asymmetric_join_capable(PlannerInfo *root,
is_asymmetric_join_capable -> is_asymmetric_join_feasible
Cheers
On 5/7/21 23:15, Zhihong Yu wrote:
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: + * Can't imagine situation when join relation already exists. But in + * the 'partition_join' regression test it happens. + * It may be an indicator of possible problems. Should a log be added in the above case ?
I made additional analysis of this branch of code. This situation can
happen in the case of one child or if we join two plane tables with
partitioned. Both situations are legal and I think we don't needed to
add any log message here.
Other mistakes were fixed.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
v16-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v16-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download
From b9d7a148ab51f297c98345127e975f864fe6ef92 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause huge consumption of CPU and memory
during reparameterization of NestLoop path.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 172 +++++++++++
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 37 ++-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 306 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 126 ++++++++
9 files changed, 672 insertions(+), 30 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index b67b517770..7a1a7b2e86 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..e9cbff9709 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1551,6 +1552,177 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_joinrelids;
+ Relids parent_relids;
+ RelOptInfo *child_joinrel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_joinrelids, parent_relids);
+
+ child_joinrel = find_join_rel(root, child_joinrelids);
+ if (!child_joinrel)
+ child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ else
+ {
+ /*
+ * The join relation already exists. For example, it could happen if
+ * we join two plane tables with partitioned table(s). Do nothing.
+ */
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_joinrel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_joinrel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_joinrel);
+ result = lappend(result, child_joinrel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure
+ * will lead to large memory allocations and a CPU consumption:
+ * each reparameterization will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of
+ * partitions in the inner. Also, if we have many partitions, each bitmapset
+ * variable will be large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. Asymmetric join isn't needed if the append node
+ * has only one child.
+ */
+ if (IsA(append_path, AppendPath) &&
+ list_length(append_path->subpaths) > 1)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 61ccfd300b..834843ba33 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan its could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* Now fix the Plan tree */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..ec8dee99b0 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -489,12 +491,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids child_relids,
Relids top_parent_relids)
{
- AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
- int nappinfos;
- int cnt;
-
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
+ AppendRelInfo **appinfos;
+ Relids parent_relids = NULL;
+ int nappinfos;
+ int cnt;
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,6 +520,7 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
@@ -565,6 +571,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,12 +586,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -715,11 +727,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +739,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 9ce5f95e3b..03c0640473 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4226,7 +4226,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..bb15fb2716 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -790,11 +790,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -851,8 +848,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2033,9 +2033,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..a0106dc539 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..601f08662e 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,312 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(12 rows)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(31 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(34 rows)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..51b81ce3f8 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,132 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (1000.0 * random())::int,
+ (1000.0 * random())::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.31.1
Andrey Lepikhov писал 2021-07-06 12:28:
On 5/7/21 23:15, Zhihong Yu wrote:
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: + * Can't imagine situation when join relation already exists. But in + * the 'partition_join' regression test it happens. + * It may be an indicator of possible problems. Should a log be added in the above case ?I made additional analysis of this branch of code. This situation can
happen in the case of one child or if we join two plane tables with
partitioned. Both situations are legal and I think we don't needed to
add any log message here.
Other mistakes were fixed.
Hi.
Small typo in comment in src/backend/optimizer/plan/setrefs.c:
281
282 /*
283 * Adjust RT indexes of AppendRelInfos and add to final
appendrels list.
284 * The AppendRelInfos are copied, because as a part of a
subplan its could
285 * be visited many times in the case of asymmetric join.
286 */
287 foreach(lc, root->append_rel_list)
288 {
its -> it (or they) ?
--
Best regards,
Alexander Pyhalov,
Postgres Professional
On 5/7/21 23:15, Zhihong Yu wrote:
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: + * Can't imagine situation when join relation already exists. But in + * the 'partition_join' regression test it happens. + * It may be an indicator of possible problems.Should a log be added in the above case ?
I worked more on this case and found more serious mistake. During
population of additional paths on the existed RelOptInfo we can remove
some previously generated paths that pointed from a higher-level list of
subplans and it could cause to lost of subplan links. I prohibit such
situation (you can read comments in the new version of the patch).
Also, choosing of a cheapest path after appendrel creation was added.
Unstable tests were fixed.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
v17-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v17-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download
From 7b05d8535b11ed7b5c43ed591d9aeea955f2c2ed Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause huge consumption of CPU and memory
during reparameterization of NestLoop path.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 187 +++++++++++
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 37 ++-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 332 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 133 ++++++++
9 files changed, 720 insertions(+), 30 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 6407ede12a..32618ebbd5 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..c8db6f0969 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1551,6 +1552,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_joinrelids;
+ Relids parent_relids;
+ RelOptInfo *child_joinrel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_joinrelids, parent_relids);
+
+ child_joinrel = find_join_rel(root, child_joinrelids);
+ if (!child_joinrel)
+ child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ else
+ {
+ /*
+ * The join relation already exists. For example, it could happen if
+ * we join two plane tables with partitioned table(s).
+ * Populating this join with additional paths could push out some
+ * previously added paths which could be pointed in a subplans list
+ * of an higher level append.
+ * Of course, we could save such paths before generating new. But it
+ * can increase too much the number of paths in complex queries. It
+ * can be a task for future work.
+ */
+ return NIL;
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_joinrel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_joinrel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_joinrel);
+ result = lappend(result, child_joinrel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure
+ * will lead to large memory allocations and a CPU consumption:
+ * each reparameterization will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of
+ * partitions in the inner. Also, if we have many partitions, each bitmapset
+ * variable will be large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. Asymmetric join isn't needed if the append node
+ * has only one child.
+ */
+ if (IsA(append_path, AppendPath) &&
+ list_length(append_path->subpaths) > 1)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ {
+ /*
+ * Add new append relation. We must choose cheapest paths after
+ * this operation because the pathlist possibly contains
+ * joinrels and appendrels that can be suboptimal.
+ */
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+ set_cheapest(joinrel);
+ }
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 26f6872b4b..6ffa68efba 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan they could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* Now fix the Plan tree */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..ec8dee99b0 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -489,12 +491,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids child_relids,
Relids top_parent_relids)
{
- AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
- int nappinfos;
- int cnt;
-
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
+ AppendRelInfo **appinfos;
+ Relids parent_relids = NULL;
+ int nappinfos;
+ int cnt;
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,6 +520,7 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
@@ -565,6 +571,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,12 +586,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -715,11 +727,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +739,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0c94cbe767..e4828b3647 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4224,7 +4224,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..bb15fb2716 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -790,11 +790,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -851,8 +848,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2033,9 +2033,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..a0106dc539 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..753b3de47f 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,338 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(13 rows)
+
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 4000
+(1 row)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(32 rows)
+
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 4000
+(1 row)
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+(35 rows)
+
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ count
+-------
+ 10000
+(1 row)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+SET enable_partitionwise_join = on;
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..aca7f885a3 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,139 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 800) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+RESET max_parallel_workers_per_gather;
+SET enable_partitionwise_join = true;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.31.1
On Thu, Jul 15, 2021 at 11:32 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru>
wrote:
On 5/7/21 23:15, Zhihong Yu wrote:
On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: + * Can't imagine situation when join relation already exists. But in + * the 'partition_join' regression test it happens. + * It may be an indicator of possible problems.Should a log be added in the above case ?
I worked more on this case and found more serious mistake. During
population of additional paths on the existed RelOptInfo we can remove
some previously generated paths that pointed from a higher-level list of
subplans and it could cause to lost of subplan links. I prohibit such
situation (you can read comments in the new version of the patch).
Also, choosing of a cheapest path after appendrel creation was added.
Unstable tests were fixed.--
regards,
Andrey Lepikhov
Postgres Professional
Patch is failing the regression, can you please take a look at that.
partition_join ... FAILED 6328 ms
--
Ibrar Ahmed
It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the status to save time for reviewers.
The new status of this patch is: Waiting on Author
On Thu, Sep 09, 2021 at 09:50:46AM +0000, Aleksander Alekseev wrote:
It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the status to save time for reviewers.
The new status of this patch is: Waiting on Author
Just to give some more info to work on I found this patch made postgres
crash with a segmentation fault.
"""
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126
126 if (shorter->words[i] != longer->words[i])
"""
attached are the query that triggers the crash and the backtrace.
--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
On 9/9/21 8:38 PM, Jaime Casanova wrote:
On Thu, Sep 09, 2021 at 09:50:46AM +0000, Aleksander Alekseev wrote:
It looks like this patch needs to be updated. According to http://cfbot.cputube.org/ it applies but doesn't pass any tests. Changing the status to save time for reviewers.
The new status of this patch is: Waiting on Author
Just to give some more info to work on I found this patch made postgres
crash with a segmentation fault."""
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000556e37ef1b55 in bms_equal (a=0x7f6e37a9c5b0, b=0x7f6e37a9c5b0) at bitmapset.c:126
126 if (shorter->words[i] != longer->words[i])
"""attached are the query that triggers the crash and the backtrace.
Thank you for this good catch!
The problem was in the adjust_child_relids_multilevel routine. The
tmp_result variable sometimes points to original required_outer.
This patch adds new ways which optimizer can generate plans. One
possible way is optimizer reparameterizes an inner by a plain relation
from the outer (maybe as a result of join of the plain relation and
partitioned relation). In this case we have to compare tmp_result with
original pointer to realize, it was changed or not.
The patch in attachment fixes this problem. Additional regression test
added.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
v18-0001-Asymmetric-partitionwise-join.patchtext/x-patch; charset=UTF-8; name=v18-0001-Asymmetric-partitionwise-join.patchDownload
From 6976e463e950f91a6a18e9f2630af1c4cb73b94b Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause huge consumption of CPU and memory
during reparameterization of NestLoop path.
Change logic of the multilevel child relids adjustment, because this
feature allows the optimizer to plan in new way.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 187 +++++++++
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 51 ++-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 378 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 167 ++++++++
9 files changed, 808 insertions(+), 36 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 6407ede12a..32618ebbd5 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 8b69870cf4..9453258f83 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1552,6 +1553,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_joinrelids;
+ Relids parent_relids;
+ RelOptInfo *child_joinrel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_joinrelids, parent_relids);
+
+ child_joinrel = find_join_rel(root, child_joinrelids);
+ if (!child_joinrel)
+ child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ else
+ {
+ /*
+ * The join relation already exists. For example, it could happen if
+ * we join two plane tables with partitioned table(s).
+ * Populating this join with additional paths could push out some
+ * previously added paths which could be pointed in a subplans list
+ * of an higher level append.
+ * Of course, we could save such paths before generating new. But it
+ * can increase too much the number of paths in complex queries. It
+ * can be a task for future work.
+ */
+ return NIL;
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_joinrel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_joinrel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_joinrel);
+ result = lappend(result, child_joinrel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure
+ * will lead to large memory allocations and a CPU consumption:
+ * each reparameterization will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of
+ * partitions in the inner. Also, if we have many partitions, each bitmapset
+ * variable will be large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. Asymmetric join isn't needed if the append node
+ * has only one child.
+ */
+ if (IsA(append_path, AppendPath) &&
+ list_length(append_path->subpaths) > 1)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ {
+ /*
+ * Add new append relation. We must choose cheapest paths after
+ * this operation because the pathlist possibly contains
+ * joinrels and appendrels that can be suboptimal.
+ */
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+ set_cheapest(joinrel);
+ }
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..fccc0685d7 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan they could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* Now fix the Plan tree */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..3ae25f3444 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -489,12 +491,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids child_relids,
Relids top_parent_relids)
{
- AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
- int nappinfos;
- int cnt;
-
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
+ AppendRelInfo **appinfos;
+ Relids parent_relids = NULL;
+ int nappinfos;
+ int cnt;
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,12 +520,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
/*
- * Substitute child relids for parent relids in a Relid set. The array of
+ * Substitute child relids for parent relids in a Relid set. The array of
* appinfos specifies the substitutions to be performed.
*/
Relids
@@ -565,8 +571,9 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
- Relids tmp_result = NULL;
+ Relids tmp_result = relids;
int cnt;
/*
@@ -579,12 +586,17 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ parent_relids = bms_union(parent_relids, normal_relids);
+ if (normal_relids)
+ bms_free(normal_relids);
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
@@ -592,13 +604,15 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
tmp_result = adjust_child_relids_multilevel(root, relids,
parent_relids,
top_parent_relids);
- relids = tmp_result;
}
- result = adjust_child_relids(relids, nappinfos, appinfos);
+ result = adjust_child_relids(tmp_result, nappinfos, appinfos);
- /* Free memory consumed by any intermediate result. */
- if (tmp_result)
+ /*
+ * Free memory consumed by any intermediate result. If recursive pass did't
+ * replace any relids, tmp_result point to the original set, not a copy.
+ */
+ if (tmp_result != relids)
bms_free(tmp_result);
bms_free(parent_relids);
pfree(appinfos);
@@ -715,11 +729,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +741,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index cedb3848dd..17e215b72d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4225,7 +4225,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 47769cea45..ddf0f5a876 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -792,11 +792,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -854,8 +851,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2036,9 +2036,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..a0106dc539 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..327505676c 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,384 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(13 rows)
+
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 4000
+(1 row)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(32 rows)
+
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 2000
+(1 row)
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+(35 rows)
+
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ count
+-------
+ 11000
+(1 row)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+SET enable_partitionwise_join = on;
+-- Check reparameterization code when required_outer set contains relid of plain
+-- relation entry, not 'other' entry.
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+ANALYZE big,small,part_l0;
+EXPLAIN
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Nested Loop Left Join (cost=0.29..33230.19 rows=33 width=8)
+ -> Append (cost=0.00..33008.08 rows=67 width=8)
+ -> Nested Loop (cost=0.00..16604.25 rows=33 width=8)
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8)
+ -> Seq Scan on part0_l2 part_l0_1 (cost=0.00..78.19 rows=5019 width=4)
+ -> Nested Loop (cost=0.00..3.75 rows=1 width=8)
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2 (cost=0.00..0.00 rows=1 width=4)
+ -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8)
+ -> Nested Loop (cost=0.00..16399.75 rows=33 width=8)
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small (cost=0.00..2.00 rows=100 width=8)
+ -> Seq Scan on part1_l1 part_l0_3 (cost=0.00..76.81 rows=4981 width=4)
+ -> Index Only Scan using big_x_idx on big (cost=0.29..3.31 rows=1 width=5)
+ Index Cond: (x = (small.x)::numeric)
+(16 rows)
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..364b3216f8 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,173 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+SET enable_partitionwise_join = on;
+
+-- Check reparameterization code when required_outer set contains relid of plain
+-- relation entry, not 'other' entry.
+
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+
+ANALYZE big,small,part_l0;
+
+EXPLAIN
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.25.1
On 14/9/21 11:37, Andrey V. Lepikhov wrote:
Thank you for this good catch!
The problem was in the adjust_child_relids_multilevel routine. The
tmp_result variable sometimes points to original required_outer.
This patch adds new ways which optimizer can generate plans. One
possible way is optimizer reparameterizes an inner by a plain relation
from the outer (maybe as a result of join of the plain relation and
partitioned relation). In this case we have to compare tmp_result with
original pointer to realize, it was changed or not.
The patch in attachment fixes this problem. Additional regression test
added.
I thought more and realized there isn't necessary to recurse in the
adjust_child_relids_multilevel() routine if required_outer contains only
normal_relids.
Also, regression tests were improved a bit.
--
regards,
Andrey Lepikhov
Postgres Professional
Attachments:
v19-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v19-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download
From ee627d07282629a85785a63341cf875bfb0decb2 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Fri, 2 Apr 2021 11:02:20 +0500
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause huge consumption of CPU and memory
during reparameterization of NestLoop path.
Change logic of the multilevel child relids adjustment, because this
feature allows the optimizer to plan in new way.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 187 ++++++++
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 44 +-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 425 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 180 ++++++++
9 files changed, 867 insertions(+), 30 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 6407ede12a..32618ebbd5 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 8b69870cf4..9453258f83 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1552,6 +1553,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_joinrelids;
+ Relids parent_relids;
+ RelOptInfo *child_joinrel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_joinrelids, parent_relids);
+
+ child_joinrel = find_join_rel(root, child_joinrelids);
+ if (!child_joinrel)
+ child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ else
+ {
+ /*
+ * The join relation already exists. For example, it could happen if
+ * we join two plane tables with partitioned table(s).
+ * Populating this join with additional paths could push out some
+ * previously added paths which could be pointed in a subplans list
+ * of an higher level append.
+ * Of course, we could save such paths before generating new. But it
+ * can increase too much the number of paths in complex queries. It
+ * can be a task for future work.
+ */
+ return NIL;
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_joinrel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_joinrel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_joinrel);
+ result = lappend(result, child_joinrel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure
+ * will lead to large memory allocations and a CPU consumption:
+ * each reparameterization will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of
+ * partitions in the inner. Also, if we have many partitions, each bitmapset
+ * variable will be large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. Asymmetric join isn't needed if the append node
+ * has only one child.
+ */
+ if (IsA(append_path, AppendPath) &&
+ list_length(append_path->subpaths) > 1)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ {
+ /*
+ * Add new append relation. We must choose cheapest paths after
+ * this operation because the pathlist possibly contains
+ * joinrels and appendrels that can be suboptimal.
+ */
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+ set_cheapest(joinrel);
+ }
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e50624c465..fccc0685d7 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -281,24 +281,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan they could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* Now fix the Plan tree */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..f4d12f76e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,7 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.appinfos = appinfos;
/* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -490,12 +492,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids top_parent_relids)
{
AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
+ Relids parent_relids = NULL;
int nappinfos;
int cnt;
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
-
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of given child. */
@@ -506,8 +506,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,12 +520,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
/*
- * Substitute child relids for parent relids in a Relid set. The array of
+ * Substitute child relids for parent relids in a Relid set. The array of
* appinfos specifies the substitutions to be performed.
*/
Relids
@@ -565,6 +571,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,13 +586,24 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ if (bms_is_subset(relids, normal_relids))
+ {
+ /* Nothing to do. Parameters set points to plain relations only. */
+ result = relids;
+ goto cleanup;
+ }
+
+ parent_relids = bms_union(parent_relids, normal_relids);
+
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
{
@@ -597,10 +615,11 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
result = adjust_child_relids(relids, nappinfos, appinfos);
+cleanup:
/* Free memory consumed by any intermediate result. */
- if (tmp_result)
- bms_free(tmp_result);
+ bms_free(tmp_result);
bms_free(parent_relids);
+ bms_free(normal_relids);
pfree(appinfos);
return result;
@@ -715,11 +734,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +746,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index cedb3848dd..17e215b72d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4225,7 +4225,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 47769cea45..ddf0f5a876 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -792,11 +792,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -854,8 +851,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2036,9 +2036,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..a0106dc539 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..3eddd9bec3 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,431 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(13 rows)
+
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 4000
+(1 row)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(32 rows)
+
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 2000
+(1 row)
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+(35 rows)
+
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ count
+-------
+ 11000
+(1 row)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+SET enable_partitionwise_join = on;
+-- Check reparameterization code when an optimizer have to make two level relids
+-- adjustment.
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+ANALYZE big,small,part_l0;
+-- Parameter have to be reparameterized by a plane relation.
+EXPLAIN (COSTS OFF)
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Nested Loop Left Join
+ -> Append
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part0_l2 part_l0_1
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2
+ -> Seq Scan on small
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part1_l1 part_l0_3
+ -> Index Only Scan using big_x_idx on big
+ Index Cond: (x = (small.x)::numeric)
+(16 rows)
+
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ x | y
+---+---
+(0 rows)
+
+-- Parameters have to be reparameterized by plane and partitioned relations.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ -> Append
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part0_l2 part_l0_1
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2
+ -> Seq Scan on small
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part1_l1 part_l0_3
+ -> Bitmap Heap Scan on big
+ Recheck Cond: ((x = (small.x)::numeric) OR (x = (part_l0.x)::numeric))
+ -> BitmapOr
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (small.x)::numeric)
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (part_l0.x)::numeric)
+(22 rows)
+
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+ count
+-------
+ 0
+(1 row)
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..4d1dd5bec1 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,186 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+SET enable_partitionwise_join = on;
+
+-- Check reparameterization code when an optimizer have to make two level relids
+-- adjustment.
+
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+
+ANALYZE big,small,part_l0;
+
+-- Parameter have to be reparameterized by a plane relation.
+EXPLAIN (COSTS OFF)
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+
+-- Parameters have to be reparameterized by plane and partitioned relations.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.33.0
Andrey Lepikhov писал 2021-09-15 09:31:
On 14/9/21 11:37, Andrey V. Lepikhov wrote:
Thank you for this good catch!
The problem was in the adjust_child_relids_multilevel routine. The
tmp_result variable sometimes points to original required_outer.
This patch adds new ways which optimizer can generate plans. One
possible way is optimizer reparameterizes an inner by a plain relation
from the outer (maybe as a result of join of the plain relation and
partitioned relation). In this case we have to compare tmp_result with
original pointer to realize, it was changed or not.
The patch in attachment fixes this problem. Additional regression test
added.I thought more and realized there isn't necessary to recurse in the
adjust_child_relids_multilevel() routine if required_outer contains
only
normal_relids.
Also, regression tests were improved a bit.
Hi.
The patch does not longer apply cleanly, so I rebased it. Attaching
rebased version.
I've looked through it once again and have several questions.
1) In adjust_appendrel_attrs_multilevel(), can it happen that
child_relids is zero-length list (in this case pfree's will fail)? It
seems, no, but should we at least assert this? Note that in
adjust_appendrel_attrs() we add logic for nappinfos being 0.
2) In try_asymmetric_partitionwise_join() we state that 'Asymmetric join
isn't needed if the append node has only one child'. This is not
completely correct. Asymmetric join with one partition can be
advantageous when JOIN(A, UNION(B)) is more expensive than UNION(JOIN
(A, B)). The later is true, for example, when we join partitioned table
having foreign partitions with another foreign table and only one
partition is left.
Let's take the attached case (foreign_join.sql). When
list_length(append_path->subpaths) > 1 is present, we get the following
plan
set enable_partitionwise_join = on;
explain SELECT t1.a,t2.b FROM fprt1 t1 INNER JOIN ftprt2_p1 t2 ON (t1.a
= t2.b) WHERE t1.a < 250 AND t2.c like '%0004' ORDER BY 1,2;
QUERY PLAN
---------------------------------------------------------------------------------------
Sort (cost=208.65..208.69 rows=17 width=8)
Sort Key: t1.a
-> Hash Join (cost=202.60..208.30 rows=17 width=8)
Hash Cond: (t1.a = t2.b)
-> Foreign Scan on ftprt1_p1 t1 (cost=100.00..105.06 rows=125
width=4)
-> Hash (cost=102.39..102.39 rows=17 width=4)
-> Foreign Scan on ftprt2_p1 t2 (cost=100.00..102.39
rows=17 width=4)
In case when we change it to list_length(append_path->subpaths) > 0, we
get foreign join and cheaper plan:
explain verbose SELECT t1.a,t2.b FROM fprt1 t1 INNER JOIN ftprt2_p1 t2
ON (t1.a = t2.b) WHERE t1.a < 250 AND t2.c like '%0004' ORDER BY 1,2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=106.15..106.19 rows=17 width=8)
Output: t1.a, t2.b
Sort Key: t1.a
-> Foreign Scan (cost=102.26..105.80 rows=17 width=8)
Output: t1.a, t2.b
Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1
t2)
Remote SQL: SELECT r4.a, r2.b FROM (public.fprt1_p1 r4 INNER
JOIN public.fprt2_p1 r2 ON (((r4.a = r2.b)) AND ((r2.c ~~ '%0004')) AND
((r4.a < 250))))
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Attachments:
v20-0001-Asymmetric-partitionwise-join.patchtext/x-diff; name=v20-0001-Asymmetric-partitionwise-join.patchDownload
From 029f0662f5401e79468a315a658b05b2f2a4e7a6 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Date: Mon, 17 Jan 2022 11:33:03 +0300
Subject: [PATCH] Asymmetric partitionwise join.
Teach optimizer to consider partitionwise join of non-partitioned
table with each partition of partitioned table.
Disallow asymmetric machinery for joining of two partitioned (or appended)
relations because it could cause huge consumption of CPU and memory
during reparameterization of NestLoop path.
Change logic of the multilevel child relids adjustment, because this
feature allows the optimizer to plan in new way.
---
src/backend/optimizer/path/joinpath.c | 9 +
src/backend/optimizer/path/joinrels.c | 187 ++++++++
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 45 +-
src/backend/optimizer/util/pathnode.c | 9 +-
src/backend/optimizer/util/relnode.c | 19 +-
src/include/optimizer/paths.h | 7 +-
src/test/regress/expected/partition_join.out | 425 +++++++++++++++++++
src/test/regress/sql/partition_join.sql | 180 ++++++++
9 files changed, 867 insertions(+), 31 deletions(-)
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index f96fc9fd282..6531981d0d8 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root,
if (set_join_pathlist_hook)
set_join_pathlist_hook(root, joinrel, outerrel, innerrel,
jointype, &extra);
+
+ /*
+ * 7. If outer relation is delivered from partition-tables, consider
+ * distributing inner relation to every partition-leaf prior to
+ * append these leafs.
+ */
+ try_asymmetric_partitionwise_join(root, joinrel,
+ outerrel, innerrel,
+ jointype, &extra);
}
/*
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9da3ff2f9ab..dadb08ddb06 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -16,6 +16,7 @@
#include "miscadmin.h"
#include "optimizer/appendinfo.h"
+#include "optimizer/cost.h"
#include "optimizer/joininfo.h"
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
@@ -1552,6 +1553,192 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * Build RelOptInfo on JOIN of each partition of the outer relation and the inner
+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one of
+ * these JOINs is impossible to build.
+ */
+static List *
+extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ AppendPath *append_path,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ List *result = NIL;
+ ListCell *lc;
+
+ foreach (lc, append_path->subpaths)
+ {
+ Path *child_path = lfirst(lc);
+ RelOptInfo *child_rel = child_path->parent;
+ Relids child_joinrelids;
+ Relids parent_relids;
+ RelOptInfo *child_joinrel;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+
+ child_joinrelids = bms_union(child_rel->relids, inner_rel->relids);
+ parent_relids = bms_union(append_path->path.parent->relids,
+ inner_rel->relids);
+
+ child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo,
+ child_rel->relids,
+ inner_rel->relids);
+ child_restrictlist = (List *)
+ adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist,
+ child_joinrelids, parent_relids);
+
+ child_joinrel = find_join_rel(root, child_joinrelids);
+ if (!child_joinrel)
+ child_joinrel = build_child_join_rel(root,
+ child_rel,
+ inner_rel,
+ joinrel,
+ child_restrictlist,
+ child_sjinfo,
+ jointype);
+ else
+ {
+ /*
+ * The join relation already exists. For example, it could happen if
+ * we join two plane tables with partitioned table(s).
+ * Populating this join with additional paths could push out some
+ * previously added paths which could be pointed in a subplans list
+ * of an higher level append.
+ * Of course, we could save such paths before generating new. But it
+ * can increase too much the number of paths in complex queries. It
+ * can be a task for future work.
+ */
+ return NIL;
+ }
+
+ populate_joinrel_with_paths(root,
+ child_rel,
+ inner_rel,
+ child_joinrel,
+ child_sjinfo,
+ child_restrictlist);
+
+ /* Give up if asymmetric partition-wise join is not available */
+ if (child_joinrel->pathlist == NIL)
+ return NIL;
+
+ set_cheapest(child_joinrel);
+ result = lappend(result, child_joinrel);
+ }
+ return result;
+}
+
+static bool
+is_asymmetric_join_feasible(PlannerInfo *root,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype)
+{
+ ListCell *lc;
+
+ if (jointype != JOIN_INNER && jointype != JOIN_LEFT)
+ return false;
+
+ if (IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel))
+ return false;
+
+ /* Disallow recursive usage of asymmertic join machinery */
+ if (root->join_rel_level == NULL)
+ return false;
+
+ /*
+ * Don't allow asymmetric JOIN of two append subplans.
+ * In the case of a parameterized NL join, a reparameterization procedure
+ * will lead to large memory allocations and a CPU consumption:
+ * each reparameterization will induce subpath duplication, creating new
+ * ParamPathInfo instance and increasing of ppilist up to number of
+ * partitions in the inner. Also, if we have many partitions, each bitmapset
+ * variable will be large and many leaks of such variable (caused by relid
+ * replacement) will highly increase memory consumption.
+ * So, we deny such paths for now.
+ */
+ foreach(lc, inner_rel->pathlist)
+ {
+ if (IsA(lfirst(lc), AppendPath))
+ return false;
+ }
+
+ return true;
+}
+
+void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra)
+{
+ ListCell *lc;
+
+ /*
+ * Try this kind of paths if we allow complex partitionwise joins and we know
+ * we can build this join safely.
+ */
+ if (!enable_partitionwise_join ||
+ !is_asymmetric_join_feasible(root, outer_rel, inner_rel, jointype))
+ return;
+
+ foreach (lc, outer_rel->pathlist)
+ {
+ AppendPath *append_path = lfirst(lc);
+
+ /*
+ * We assume this pathlist keeps at least one AppendPath that
+ * represents partitioned table-scan, symmetric or asymmetric
+ * partition-wise join. Asymmetric join isn't needed if the append node
+ * has only one child.
+ */
+ if (IsA(append_path, AppendPath) &&
+ list_length(append_path->subpaths) > 1)
+ {
+ List **join_rel_level_saved;
+ List *live_childrels = NIL;
+
+ join_rel_level_saved = root->join_rel_level;
+ PG_TRY();
+ {
+ /* temporary disables "dynamic programming" algorithm */
+ root->join_rel_level = NULL;
+
+ live_childrels =
+ extract_asymmetric_partitionwise_subjoin(root,
+ joinrel,
+ append_path,
+ inner_rel,
+ jointype,
+ extra);
+ }
+ PG_FINALLY();
+ {
+ root->join_rel_level = join_rel_level_saved;
+ }
+ PG_END_TRY();
+
+ if (live_childrels != NIL)
+ {
+ /*
+ * Add new append relation. We must choose cheapest paths after
+ * this operation because the pathlist possibly contains
+ * joinrels and appendrels that can be suboptimal.
+ */
+ add_paths_to_append_rel(root, joinrel, live_childrels);
+ set_cheapest(joinrel);
+ }
+
+ break;
+ }
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e44ae971b4b..7cc7c863b80 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -282,24 +282,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan they could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* If needed, create workspace for processing AlternativeSubPlans */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index b8039c323b6..76d922a9778 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -201,8 +201,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.nappinfos = nappinfos;
context.appinfos = appinfos;
- /* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -490,12 +491,10 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
Relids top_parent_relids)
{
AppendRelInfo **appinfos;
- Bitmapset *parent_relids = NULL;
+ Relids parent_relids = NULL;
int nappinfos;
int cnt;
- Assert(bms_num_members(child_relids) == bms_num_members(top_parent_relids));
-
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of given child. */
@@ -506,8 +505,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
}
- /* Recurse if immediate parent is not the top parent. */
- if (!bms_equal(parent_relids, top_parent_relids))
+ /*
+ * Recurse if immediate parent is not the top parent. Keep in mind that in a
+ * case of asymmetric JOIN top_parent_relids can contain relids which aren't
+ * part of an append node.
+ */
+ if (!bms_equal(parent_relids, top_parent_relids) &&
+ !bms_is_subset(parent_relids, top_parent_relids))
node = adjust_appendrel_attrs_multilevel(root, node, parent_relids,
top_parent_relids);
@@ -515,12 +519,13 @@ adjust_appendrel_attrs_multilevel(PlannerInfo *root, Node *node,
node = adjust_appendrel_attrs(root, node, nappinfos, appinfos);
pfree(appinfos);
+ pfree(parent_relids);
return node;
}
/*
- * Substitute child relids for parent relids in a Relid set. The array of
+ * Substitute child relids for parent relids in a Relid set. The array of
* appinfos specifies the substitutions to be performed.
*/
Relids
@@ -565,6 +570,7 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
AppendRelInfo **appinfos;
int nappinfos;
Relids parent_relids = NULL;
+ Relids normal_relids = NULL;
Relids result;
Relids tmp_result = NULL;
int cnt;
@@ -579,13 +585,24 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
appinfos = find_appinfos_by_relids(root, child_relids, &nappinfos);
/* Construct relids set for the immediate parent of the given child. */
+ normal_relids = bms_copy(child_relids);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];
parent_relids = bms_add_member(parent_relids, appinfo->parent_relid);
+ normal_relids = bms_del_member(normal_relids, appinfo->child_relid);
}
+ if (bms_is_subset(relids, normal_relids))
+ {
+ /* Nothing to do. Parameters set points to plain relations only. */
+ result = relids;
+ goto cleanup;
+ }
+
+ parent_relids = bms_union(parent_relids, normal_relids);
+
/* Recurse if immediate parent is not the top parent. */
if (!bms_equal(parent_relids, top_parent_relids))
{
@@ -597,10 +614,11 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
result = adjust_child_relids(relids, nappinfos, appinfos);
+cleanup:
/* Free memory consumed by any intermediate result. */
- if (tmp_result)
- bms_free(tmp_result);
+ bms_free(tmp_result);
bms_free(parent_relids);
+ bms_free(normal_relids);
pfree(appinfos);
return result;
@@ -715,11 +733,11 @@ AppendRelInfo **
find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
{
AppendRelInfo **appinfos;
+ int nrooms = bms_num_members(relids);
int cnt = 0;
int i;
- *nappinfos = bms_num_members(relids);
- appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * *nappinfos);
+ appinfos = (AppendRelInfo **) palloc(sizeof(AppendRelInfo *) * nrooms);
i = -1;
while ((i = bms_next_member(relids, i)) >= 0)
@@ -727,10 +745,11 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- elog(ERROR, "child rel %d not found in append_rel_array", i);
+ continue;
appinfos[cnt++] = appinfo;
}
+ *nappinfos = cnt;
return appinfos;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5c32c96b71c..31374ceea4e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -4227,7 +4227,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 520409f4ba0..bc750fda2a4 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -792,11 +792,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
joinrel->reloptkind = RELOPT_OTHER_JOINREL;
joinrel->relids = bms_union(outer_rel->relids, inner_rel->relids);
@@ -854,8 +851,11 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
- joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
- inner_rel->top_parent_relids);
+ joinrel->top_parent_relids =
+ bms_union(IS_OTHER_REL(outer_rel) ?
+ outer_rel->top_parent_relids : outer_rel->relids,
+ IS_OTHER_REL(inner_rel) ?
+ inner_rel->top_parent_relids : inner_rel->relids);
/* Compute information relevant to foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -2036,9 +2036,8 @@ build_child_join_reltarget(PlannerInfo *root,
{
/* Build the targetlist */
childrel->reltarget->exprs = (List *)
- adjust_appendrel_attrs(root,
- (Node *) parentrel->reltarget->exprs,
- nappinfos, appinfos);
+ adjust_appendrel_attrs_multilevel(root, (Node *)parentrel->reltarget->exprs,
+ childrel->relids, parentrel->relids);
/* Set the cost and width fields */
childrel->reltarget->cost.startup = parentrel->reltarget->cost.startup;
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0c3a0b90c85..9ed55e1ec85 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -109,7 +109,12 @@ extern bool have_join_order_restriction(PlannerInfo *root,
extern bool have_dangerous_phv(PlannerInfo *root,
Relids outer_relids, Relids inner_params);
extern void mark_dummy_rel(RelOptInfo *rel);
-
+extern void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *outer_rel,
+ RelOptInfo *inner_rel,
+ JoinType jointype,
+ JoinPathExtraData *extra);
/*
* equivclass.c
* routines for managing EquivalenceClasses
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index bb5b7c47a45..0acc5c007bd 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2320,6 +2320,431 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Seq Scan on prt5_p0
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p1.a = t5_1.aid)
+ -> Seq Scan on prt5_p1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_p2.a = t5_1.aid)
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(13 rows)
+
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 4000
+(1 row)
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = sq1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+ -> Hash Join
+ Hash Cond: (prt5_2.a = sq1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_4
+ -> Seq Scan on prt6_p1 prt6_5
+ -> Hash Join
+ Hash Cond: (prt5_3.a = sq1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_7
+ -> Seq Scan on prt6_p1 prt6_8
+(32 rows)
+
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 2000
+(1 row)
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+(35 rows)
+
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ count
+-------
+ 11000
+(1 row)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Left Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+SET enable_partitionwise_join = on;
+-- Check reparameterization code when an optimizer have to make two level relids
+-- adjustment.
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+ANALYZE big,small,part_l0;
+-- Parameter have to be reparameterized by a plane relation.
+EXPLAIN (COSTS OFF)
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Nested Loop Left Join
+ -> Append
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part0_l2 part_l0_1
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2
+ -> Seq Scan on small
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part1_l1 part_l0_3
+ -> Index Only Scan using big_x_idx on big
+ Index Cond: (x = (small.x)::numeric)
+(16 rows)
+
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ x | y
+---+---
+(0 rows)
+
+-- Parameters have to be reparameterized by plane and partitioned relations.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ -> Append
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part0_l2 part_l0_1
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2
+ -> Seq Scan on small
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part1_l1 part_l0_3
+ -> Bitmap Heap Scan on big
+ Recheck Cond: ((x = (small.x)::numeric) OR (x = (part_l0.x)::numeric))
+ -> BitmapOr
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (small.x)::numeric)
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (part_l0.x)::numeric)
+(22 rows)
+
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+ count
+-------
+ 0
+(1 row)
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 67f506361f8..428cdaf424b 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -536,6 +536,186 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check asymmetric JOIN with Subquery
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+SET enable_partitionwise_join = on;
+
+-- Check reparameterization code when an optimizer have to make two level relids
+-- adjustment.
+
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+
+ANALYZE big,small,part_l0;
+
+-- Parameter have to be reparameterized by a plane relation.
+EXPLAIN (COSTS OFF)
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+
+-- Parameters have to be reparameterized by plane and partitioned relations.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
--
2.25.1
Hi Alexander,
Hi Andrey,
Thank you for your work on this subject.
On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:
The patch does not longer apply cleanly, so I rebased it. Attaching
rebased version.
Not surprising that the patch doesn't apply after 1.5 years since the
last message. Could you please rebase it?
I read the thread and the patch. The patch improves the joining of
partitioned tables with non-partitioned relations. Let's denote
non-partitioned relation as A, partitions as P1 ... PN. The patch
allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
Append(P1, ... PN). That could be cheaper because it's generally
cheaper to join small pieces rather than do one big join. The
drawback is the need to scan A multiple times. But is this really
necessary and acceptable? Let's consider multiple options.
1) A is non-table. For instance, A is a function scan. In this case,
doing multiple scans of A is not just expensive, but could lead to
unexpected side effects. When the user includes a function once in
the FROM clause, she expects this function to be evaluated once. I
propose that we should materialize a scan of non-table relations. So,
materialized representation will be scanned multiple times, but the
source only scanned once. That would be similar to CTE.
2) A is the table to be scanned with the parametrized path in the
inner part of the nested loop join. In this case, there is no big
scan of A and nothing to materialize.
3) A is the table to be used in merge join or outer part of nested
loop join. In this case, it would be nice to consider materialize.
It's not always good to materialize, because materialization has its
additional costs. I think that could be a cost-based decision.
4) A is used in the hash join. Could we re-use the hashed
representation of A between multiple joins? I read upthread it was
proposed to share a hashed table between multiple background workers
via shared memory. But the first step would be to just share it
between multiple join nodes within the same process.
As we consider joining with each partition individually, there could
be chosen different join methods. As I get, the current patch
considers joining with each of the partitions as a separate isolated
optimization task. However, if we share resources between the
multiple joins, then rises a need for some global optimization. For
instance, a join type could be expensive when applied to an individual
partition, but cheap when applied to all the partitions thanks to
saving the common work.
My idea is to consider generated common resources (such as
materialized scans) as a property of the path. For instance, if the
nested loop join is cheaper than the hash join, but the hash join
generates a common hash map of table A, we don't drop hash join
immediately from the consideration and leave it to see how it could
help join other partitions. What do you think?
------
Regards,
Alexander Korotkov
On 15/10/2023 07:18, Alexander Korotkov wrote:
Hi Alexander,
Hi Andrey,Thank you for your work on this subject.
On Mon, Jan 17, 2022 at 1:42 PM Alexander Pyhalov
<a.pyhalov@postgrespro.ru> wrote:The patch does not longer apply cleanly, so I rebased it. Attaching
rebased version.Not surprising that the patch doesn't apply after 1.5 years since the
last message. Could you please rebase it?I read the thread and the patch. The patch improves the joining of
partitioned tables with non-partitioned relations. Let's denote
non-partitioned relation as A, partitions as P1 ... PN. The patch
allows to Append(Join(A, P1), ... Join(A, PN) instead of Join(A,
Append(P1, ... PN). That could be cheaper because it's generally
cheaper to join small pieces rather than do one big join. The
drawback is the need to scan A multiple times. But is this really
necessary and acceptable? Let's consider multiple options.1) A is non-table. For instance, A is a function scan. In this case,
doing multiple scans of A is not just expensive, but could lead to
unexpected side effects. When the user includes a function once in
the FROM clause, she expects this function to be evaluated once. I
propose that we should materialize a scan of non-table relations. So,
materialized representation will be scanned multiple times, but the
source only scanned once. That would be similar to CTE.
2) A is the table to be scanned with the parametrized path in the
inner part of the nested loop join. In this case, there is no big
scan of A and nothing to materialize.
3) A is the table to be used in merge join or outer part of nested
loop join. In this case, it would be nice to consider materialize.
It's not always good to materialize, because materialization has its
additional costs. I think that could be a cost-based decision.
4) A is used in the hash join. Could we re-use the hashed
representation of A between multiple joins? I read upthread it was
proposed to share a hashed table between multiple background workers
via shared memory. But the first step would be to just share it
between multiple join nodes within the same process.As we consider joining with each partition individually, there could
be chosen different join methods. As I get, the current patch
considers joining with each of the partitions as a separate isolated
optimization task. However, if we share resources between the
multiple joins, then rises a need for some global optimization. For
instance, a join type could be expensive when applied to an individual
partition, but cheap when applied to all the partitions thanks to
saving the common work.My idea is to consider generated common resources (such as
materialized scans) as a property of the path. For instance, if the
nested loop join is cheaper than the hash join, but the hash join
generates a common hash map of table A, we don't drop hash join
immediately from the consideration and leave it to see how it could
help join other partitions. What do you think?
Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1]Oversight in reparameterize_path_by_child leading to executor crash /messages/by-id/CAMbWs496+N=UAjOc=rcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw@mail.gmail.com resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.
Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.
[1]: Oversight in reparameterize_path_by_child leading to executor crash /messages/by-id/CAMbWs496+N=UAjOc=rcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw@mail.gmail.com
/messages/by-id/CAMbWs496+N=UAjOc=rcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw@mail.gmail.com
--
regards,
Andrey Lepikhov
Postgres Professional
On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1] resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.
Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.
Great! I'm looking forward to the revised patch.
------
Regards,
Alexander Korotkov
On 15/10/2023 17:25, Alexander Korotkov wrote:
On Sun, Oct 15, 2023 at 8:40 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:Thanks for such detailed feedback!
The rationale for this patch was to give the optimizer additional ways
to push down more joins into foreign servers. And, because of
asynchronous append, the benefit of that optimization was obvious.
Unfortunately, we hadn't found other applications for this feature,
which was why this patch was postponed in the core.
You have brought new ideas about applying this idea locally. Moreover,
the main issue of the patch was massive memory consumption in the case
of many joins and partitions - because of reparameterization. But now,
postponing the reparameterization proposed in the thread [1] resolves
that problem and gives some insights into the reparameterization
technique of some fields, like lateral references.
Hence, I think we can restart this work.
The first thing here (after rebase, of course) is to figure out and
implement in the cost model cases of effectiveness when asymmetric join
would give significant performance.Great! I'm looking forward to the revised patch
Before preparing a new patch, it would be better to find the common
ground in the next issue:
So far, this optimization stays aside, proposing an alternative path for
a join RelOptInfo if we have an underlying append path in the outer.
My back burner is redesigning the approach: asymmetric join doesn't
change the partitioning scheme and bounds of the partitioned side. So,
it looks consecutive to make it a part of partitionwise_join machinery
and implement it as a part of the try_partitionwise_join /
generate_partitionwise_join_paths routines.
--
regards,
Andrey Lepikhov
Postgres Professional
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Great! I'm looking forward to the revised patch
Before preparing a new patch, it would be better to find the common
ground in the next issue:
So far, this optimization stays aside, proposing an alternative path for
a join RelOptInfo if we have an underlying append path in the outer.
My back burner is redesigning the approach: asymmetric join doesn't
change the partitioning scheme and bounds of the partitioned side. So,
it looks consecutive to make it a part of partitionwise_join machinery
and implement it as a part of the try_partitionwise_join /
generate_partitionwise_join_paths routines.
I think we need an example where such a join will be faster than
non-partitioned join when both the sides are local. It might be
possible to come up with such an example without writing any code. The
idea would be to rewrite SQL as union of joins.
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.
--
Best Wishes,
Ashutosh Bapat
On 16/10/2023 23:21, Ashutosh Bapat wrote:
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.
I'm curious, which type of problems do you afraid here? Why we need a
range table entry for each scan of non-partitioned relation?
--
regards,
Andrey Lepikhov
Postgres Professional
On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 16/10/2023 23:21, Ashutosh Bapat wrote:
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.I'm curious, which type of problems do you afraid here? Why we need a
range table entry for each scan of non-partitioned relation?
Not RTE but RelOptInfo.
Using the same example as Alexander Korotkov, let's say A is the
nonpartitioned table and P is partitioned table with partitions P1,
P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
APn. Each of these joins may have different properties and thus will
require creating paths. In order to save these paths, we need
RelOptInfos which are indentified by relids. Let's assume that the
relids of these join RelOptInfos are created by union of relid of A
and relid of Px (the partition being joined). This is notionally
misleading but doable.
But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.
The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.
But if you have developed a way to use a single RelOptInfo of A to do
all this, may be we don't need all this. Will take a look at your next
version of patch.
--
Best Wishes,
Ashutosh Bapat
On 17/10/2023 17:09, Ashutosh Bapat wrote:
On Tue, Oct 17, 2023 at 2:05 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:On 16/10/2023 23:21, Ashutosh Bapat wrote:
On Mon, Oct 16, 2023 at 10:24 AM Andrei Lepikhov
Whenever I visited this idea, I hit one issue prominently - how would
we differentiate different scans of the non-partitioned relation.
Normally we do that using different Relids but in this case we
wouldn't be able to know the number of such relations involved in the
query unless we start planning such a join. It's late to add new base
relations and assign them new Relids. Of course I haven't thought hard
about it. I haven't looked at the patch to see whether this problem is
solved and how.I'm curious, which type of problems do you afraid here? Why we need a
range table entry for each scan of non-partitioned relation?Not RTE but RelOptInfo.
Using the same example as Alexander Korotkov, let's say A is the
nonpartitioned table and P is partitioned table with partitions P1,
P2, ... Pn. The partitionwise join would need to compute AP1, AP2, ...
APn. Each of these joins may have different properties and thus will
require creating paths. In order to save these paths, we need
RelOptInfos which are indentified by relids. Let's assume that the
relids of these join RelOptInfos are created by union of relid of A
and relid of Px (the partition being joined). This is notionally
misleading but doable.
Ok, now I see your disquiet. In current patch we have built RelOptInfo
for each JOIN(A, Pi) by the build_child_join_rel() routine. And of
course, they all have different sets of cheapest paths (it is one more
point of optimality). At this point the RelOptInfo of relation A is
fully formed and upper joins use the pathlist "as is", without changes.
But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.
Does the answer above resolved this issue?
The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.
I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?
--
regards,
Andrey Lepikhov
Postgres Professional
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.Does the answer above resolved this issue?
May be. There are other problematic areas like EvalPlanQual, Rescans,
reparameterised paths which can blow up if we use the same RelOptInfo
for different scans of the same relation. It will be good to test
those. And also A need not be a simple relation; it could be join as
well.
The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?
See Scan::scanrelid
--
Best Wishes,
Ashutosh Bapat
On 18/10/2023 16:59, Ashutosh Bapat wrote:
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:But the clauses of A parameterized by P will produce different
translations for each of the partitions. I think we will need
different RelOptInfos (for A) to store these translations.Does the answer above resolved this issue?
May be. There are other problematic areas like EvalPlanQual, Rescans,
reparameterised paths which can blow up if we use the same RelOptInfo
for different scans of the same relation. It will be good to test
Yeah, now I got it. It is already the second place where I see some
reference to a kind of hidden rule that the rte entry (or RelOptInfo)
must correspond to only one plan node. I don't have a quick answer for
now - maybe it is a kind of architectural agreement - and I will
consider this issue during the development.
those. And also A need not be a simple relation; it could be join as
well.
For a join RelOptInfo, as well as for any subtree, we have the same
logic: the pathlist of this subtree is already formed during the
previous level of the search and will not be changed.
The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?See Scan::scanrelid
--
regards,
Andrey Lepikhov
Postgres Professional
On 15/10/2023 13:25, Alexander Korotkov wrote:
Great! I'm looking forward to the revised patch.
Revising the code and opinions before restarting this work, I found two
different possible strategies mentioned in the thread:
1. 'Common Resources' shares the materialised result of the inner table
scan (a hash table in the case of HashJoin) to join each partition one
by one. It gives us a profit in the case of parallel append and possibly
other cases, like the one shown in the initial message.
2. 'Individual strategies' - By limiting the AJ feature to cases when
the JOIN clause contains a partitioning expression, we can push an
additional scan clause into each copy of the inner table scan, reduce
the number of tuples scanned, and even prune something because of proven
zero input.
I see the pros and cons of both approaches. The first option is more
straightforward, and its outcome is obvious in the case of parallel
append. But how can we guarantee the same join type for each join? Why
should we ignore the positive effect of different strategies for
different partitions?
The second strategy is more expensive for the optimiser, especially in
the multipartition case. But as I can predict, it is easier to implement
and looks more natural for the architecture. What do you think about that?
--
regards,
Andrei Lepikhov
Postgres Professional
On 18/10/2023 16:59, Ashutosh Bapat wrote:
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?See Scan::scanrelid
Hi,
In the attachment, you will find a fresh version of the patch.
I've analysed the danger of the same RelOptInfo index for the executor.
In the examples I found (scared), it is still not a problem because
ExecQual() does all the jobs at one operation and doesn't intersect with
over operations. Of course, it is not a good design, and we will work on
this issue. But at least this code can be used in experiments.
Furthermore, I've shared some reflections on this feature. To avoid
cluttering the thread, I've published them in [1]https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy. These thoughts
provide additional context and considerations for our ongoing work.
[1]: https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy
https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy
--
regards, Andrei Lepikhov
Attachments:
v21-0001-Introduce-the-Asymmetric-Join-AJ-feature.patchtext/plain; charset=UTF-8; name=v21-0001-Introduce-the-Asymmetric-Join-AJ-feature.patchDownload
From 02df30cf9f1520688a0b474f7ecd106ba004b201 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Sat, 4 May 2024 13:05:45 +0700
Subject: [PATCH] Introduce the Asymmetric Join (AJ) feature. Asymmetric Join
is an addition to the partitionwise join strategy: it allows the optimiser
when it sees join of plain table with partitioned one to attempt joining the
table with each of partitions separately.
Here, as usual, we have some pro and cons:
* Smaller JOIN can survive memory allocations or data skews
* Different strategies for different pairs may be more effective.
* It may allow to prune more partitions and with partition constrain select smaller volume of data.
* But of course, here we have increasing complexity of planning.
To implement this feature we add consider_asymmetric_join into the RelOptInfo
structure. It must be true iff consider_partitionwise_join is false.
Implementation follows the logic of partitionwise_join.
We disallow any partitioned relations in the inner of AJ. The key point here
is to resolve the problem with different ways of achieving RelOptInfo which can
need different number of part_rels.
One annoying feature of the implementation is some sort of asymmetry introduced:
we have dependency on the order of incoming relations. If AJ was approved for
a joinrel, possible PWJ for this joinrel with different combination of inner and
outer will be declined.
New GUC enable_asymmetric_join allows to disable this feature, if necessary.
A lot of regression tests were changed or newly created.
TODO:
1. Here we doesn't resolved Ashutosh Bapat's warning on scanning of one
RangeEntryEntry in different parts of the plan. PostgreSQL used to implicitly
assume that each leaf RangeTable (and RelOptInfo) corresponds only one scan node.
For example, such partition pruning code is designed with such assumption.
An qual evaluation code also references index of scan node.
After analysis We didn't found any problems which can be caused by this
inconsistency. Unfortunately, it doesn't guarantee any issues in the future.
2. We should discover on the partition pruning more. May a join order change
impact pruning outcome and how it can influence partitioning schema?
---
.../postgres_fdw/expected/postgres_fdw.out | 211 +++--
contrib/postgres_fdw/sql/postgres_fdw.sql | 9 +
src/backend/optimizer/path/allpaths.c | 7 +-
src/backend/optimizer/path/costsize.c | 1 +
src/backend/optimizer/path/joinrels.c | 222 ++++-
src/backend/optimizer/plan/createplan.c | 5 +-
src/backend/optimizer/plan/setrefs.c | 17 +-
src/backend/optimizer/util/appendinfo.c | 13 +-
src/backend/optimizer/util/pathnode.c | 151 +++-
src/backend/optimizer/util/relnode.c | 192 ++++-
src/backend/utils/misc/guc_tables.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/nodes/pathnodes.h | 2 +
src/include/optimizer/cost.h | 1 +
src/include/optimizer/pathnode.h | 6 +-
src/test/regress/expected/partition_join.out | 810 ++++++++++++++++++
src/test/regress/expected/partition_prune.out | 440 +++++++---
src/test/regress/expected/sysviews.out | 3 +-
src/test/regress/sql/partition_join.sql | 276 ++++++
19 files changed, 2138 insertions(+), 239 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 078b8a966f..5fc1f57f72 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -10070,6 +10070,60 @@ SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a
400 | 400
(4 rows)
+-- Apply AJ to foreign tables
+EXPLAIN (COSTS OFF)
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2;
+ QUERY PLAN
+---------------------------------------------------------------
+ Merge Append
+ Sort Key: t1.a
+ -> Foreign Scan
+ Relations: (ftprt1_p1 t1_1) INNER JOIN (ftprt2_p1 t2)
+ -> Foreign Scan
+ Relations: (ftprt1_p2 t1_2) INNER JOIN (ftprt2_p1 t2)
+(6 rows)
+
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2;
+ a | b
+-----+-----
+ 204 | 204
+ 210 | 210
+ 216 | 216
+ 222 | 222
+ 228 | 228
+ 234 | 234
+ 240 | 240
+ 246 | 246
+(8 rows)
+
+-- FOR UPDATE requires whole-row reference, and so asymmetric join doesn't apply
+EXPLAIN (COSTS OFF)
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2 FOR UPDATE OF t2;
+ QUERY PLAN
+--------------------------------------------------
+ LockRows
+ -> Merge Join
+ Merge Cond: (t1.a = t2.b)
+ -> Append
+ -> Foreign Scan on ftprt1_p1 t1_1
+ -> Foreign Scan on ftprt1_p2 t1_2
+ -> Materialize
+ -> Foreign Scan on ftprt2_p1 t2
+(8 rows)
+
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2 FOR UPDATE OF t2;
+ a | b
+-----+-----
+ 204 | 204
+ 210 | 210
+ 216 | 216
+ 222 | 222
+ 228 | 228
+ 234 | 234
+ 240 | 240
+ 246 | 246
+(8 rows)
+
RESET enable_partitionwise_join;
-- ===================================================================
-- test partitionwise aggregates
@@ -11338,25 +11392,32 @@ RESET enable_partitionwise_join;
SET enable_hashjoin TO false;
EXPLAIN (VERBOSE, COSTS OFF)
INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
- QUERY PLAN
-----------------------------------------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Insert on public.join_tbl
- -> Nested Loop
- Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
- Join Filter: ((t1.a = t2.a) AND (t1.b = t2.b))
- -> Foreign Scan on public.async_p1 t1
- Output: t1.a, t1.b, t1.c
- Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
- -> Append
- -> Async Foreign Scan on public.async_p1 t2_1
- Output: t2_1.a, t2_1.b, t2_1.c
- Remote SQL: SELECT a, b, c FROM public.base_tbl1
- -> Async Foreign Scan on public.async_p2 t2_2
+ -> Append
+ -> Async Foreign Scan
+ Output: t1.a, t1.b, t1.c, t2_1.a, t2_1.b, t2_1.c
+ Relations: (public.async_p1 t2_1) INNER JOIN (public.async_p1 t1)
+ Remote SQL: SELECT r3.a, r3.b, r3.c, r5.a, r5.b, r5.c FROM (public.base_tbl1 r5 INNER JOIN public.base_tbl1 r3 ON (((r3.a = r5.a)) AND ((r3.b = r5.b)) AND (((r3.b % 100) = 0))))
+ -> Nested Loop
+ Output: t1.a, t1.b, t1.c, t2_2.a, t2_2.b, t2_2.c
+ Join Filter: ((t1.a = t2_2.a) AND (t1.b = t2_2.b))
+ -> Foreign Scan on public.async_p1 t1
+ Output: t1.a, t1.b, t1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
+ -> Foreign Scan on public.async_p2 t2_2
Output: t2_2.a, t2_2.b, t2_2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Nested Loop
+ Output: t1.a, t1.b, t1.c, t2_3.a, t2_3.b, t2_3.c
+ Join Filter: ((t1.a = t2_3.a) AND (t1.b = t2_3.b))
+ -> Foreign Scan on public.async_p1 t1
+ Output: t1.a, t1.b, t1.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % 100) = 0))
-> Seq Scan on public.async_p3 t2_3
Output: t2_3.a, t2_3.b, t2_3.c
-(16 rows)
+(23 rows)
INSERT INTO join_tbl SELECT * FROM async_p1 t1, async_pt t2 WHERE t1.a = t2.a AND t1.b = t2.b AND t1.b % 100 = 0;
SELECT * FROM join_tbl ORDER BY a1;
@@ -11464,39 +11525,61 @@ ALTER FOREIGN TABLE async_p1 OPTIONS (use_remote_estimate 'true');
ALTER FOREIGN TABLE async_p2 OPTIONS (use_remote_estimate 'true');
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
- QUERY PLAN
-------------------------------------------------------------------------------------------
- Nested Loop
- Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt.a, async_pt.b, async_pt.c
- -> Seq Scan on public.local_tbl
- Output: local_tbl.a, local_tbl.b, local_tbl.c
- Filter: (local_tbl.c = 'bar'::text)
- -> Append
- -> Async Foreign Scan on public.async_p1 async_pt_1
+ QUERY PLAN
+-------------------------------------------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt_1.a, async_pt_1.b, async_pt_1.c
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ Filter: (local_tbl.c = 'bar'::text)
+ -> Foreign Scan on public.async_p1 async_pt_1
Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE ((a = $1::integer))
- -> Async Foreign Scan on public.async_p2 async_pt_2
+ -> Nested Loop
+ Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt_2.a, async_pt_2.b, async_pt_2.c
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ Filter: (local_tbl.c = 'bar'::text)
+ -> Foreign Scan on public.async_p2 async_pt_2
Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE ((a = $1::integer))
+ -> Hash Join
+ Output: local_tbl.a, local_tbl.b, local_tbl.c, async_pt_3.a, async_pt_3.b, async_pt_3.c
+ Hash Cond: (async_pt_3.a = local_tbl.a)
-> Seq Scan on public.async_p3 async_pt_3
Output: async_pt_3.a, async_pt_3.b, async_pt_3.c
- Filter: (async_pt_3.a = local_tbl.a)
-(15 rows)
+ -> Hash
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ -> Seq Scan on public.local_tbl
+ Output: local_tbl.a, local_tbl.b, local_tbl.c
+ Filter: (local_tbl.c = 'bar'::text)
+(27 rows)
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
- QUERY PLAN
--------------------------------------------------------------------------------
- Nested Loop (actual rows=1 loops=1)
- -> Seq Scan on local_tbl (actual rows=1 loops=1)
- Filter: (c = 'bar'::text)
- Rows Removed by Filter: 1
- -> Append (actual rows=1 loops=1)
- -> Async Foreign Scan on async_p1 async_pt_1 (never executed)
- -> Async Foreign Scan on async_p2 async_pt_2 (actual rows=1 loops=1)
- -> Seq Scan on async_p3 async_pt_3 (never executed)
- Filter: (a = local_tbl.a)
-(9 rows)
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Append (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on local_tbl (actual rows=1 loops=1)
+ Filter: (c = 'bar'::text)
+ Rows Removed by Filter: 1
+ -> Foreign Scan on async_p1 async_pt_1 (actual rows=0 loops=1)
+ -> Nested Loop (actual rows=1 loops=1)
+ -> Seq Scan on local_tbl (actual rows=1 loops=1)
+ Filter: (c = 'bar'::text)
+ Rows Removed by Filter: 1
+ -> Foreign Scan on async_p2 async_pt_2 (actual rows=1 loops=1)
+ -> Hash Join (actual rows=0 loops=1)
+ Hash Cond: (async_pt_3.a = local_tbl.a)
+ -> Seq Scan on async_p3 async_pt_3 (actual rows=200 loops=1)
+ -> Hash (actual rows=1 loops=1)
+ Buckets: 1024 Batches: 1 Memory Usage: 9kB
+ -> Seq Scan on local_tbl (actual rows=1 loops=1)
+ Filter: (c = 'bar'::text)
+ Rows Removed by Filter: 1
+(19 rows)
SELECT * FROM local_tbl, async_pt WHERE local_tbl.a = async_pt.a AND local_tbl.c = 'bar';
a | b | c | a | b | c
@@ -11670,29 +11753,45 @@ SET enable_mergejoin TO false;
SET enable_hashjoin TO false;
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
- QUERY PLAN
-----------------------------------------------------------------
- Nested Loop
- Output: t1.a, t1.b, t1.c, t2.a, t2.b, t2.c
- Join Filter: (t1.a = t2.a)
- -> Append
- -> Async Foreign Scan on public.async_p1 t1_1
+ QUERY PLAN
+----------------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Output: t1_1.a, t1_1.b, t1_1.c, t2.a, t2.b, t2.c
+ Join Filter: (t1_1.a = t2.a)
+ -> Foreign Scan on public.async_p2 t2
+ Output: t2.a, t2.b, t2.c
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Materialize
Output: t1_1.a, t1_1.b, t1_1.c
- Filter: (t1_1.b === 505)
- Remote SQL: SELECT a, b, c FROM public.base_tbl1
- -> Async Foreign Scan on public.async_p2 t1_2
- Output: t1_2.a, t1_2.b, t1_2.c
- Filter: (t1_2.b === 505)
+ -> Foreign Scan on public.async_p1 t1_1
+ Output: t1_1.a, t1_1.b, t1_1.c
+ Filter: (t1_1.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl1
+ -> Nested Loop
+ Output: t1_2.a, t1_2.b, t1_2.c, t2.a, t2.b, t2.c
+ Join Filter: (t1_2.a = t2.a)
+ -> Foreign Scan on public.async_p2 t2
+ Output: t2.a, t2.b, t2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2
- -> Seq Scan on public.async_p3 t1_3
- Output: t1_3.a, t1_3.b, t1_3.c
- Filter: (t1_3.b === 505)
- -> Materialize
- Output: t2.a, t2.b, t2.c
+ -> Materialize
+ Output: t1_2.a, t1_2.b, t1_2.c
+ -> Foreign Scan on public.async_p2 t1_2
+ Output: t1_2.a, t1_2.b, t1_2.c
+ Filter: (t1_2.b === 505)
+ Remote SQL: SELECT a, b, c FROM public.base_tbl2
+ -> Nested Loop
+ Output: t1_3.a, t1_3.b, t1_3.c, t2.a, t2.b, t2.c
+ Join Filter: (t1_3.a = t2.a)
-> Foreign Scan on public.async_p2 t2
Output: t2.a, t2.b, t2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2
-(20 rows)
+ -> Materialize
+ Output: t1_3.a, t1_3.b, t1_3.c
+ -> Seq Scan on public.async_p3 t1_3
+ Output: t1_3.a, t1_3.b, t1_3.c
+ Filter: (t1_3.b === 505)
+(36 rows)
SELECT * FROM async_pt t1, async_p2 t2 WHERE t1.a = t2.a AND t1.b === 505;
a | b | c | a | b | c
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 09ba234e43..35e65bba2b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -3139,6 +3139,15 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
SELECT t1.a, t2.b FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) WHERE t1.a % 25 = 0 ORDER BY 1,2 FOR UPDATE OF t1;
+-- Apply AJ to foreign tables
+EXPLAIN (COSTS OFF)
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2;
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2;
+-- FOR UPDATE requires whole-row reference, and so asymmetric join doesn't apply
+EXPLAIN (COSTS OFF)
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2 FOR UPDATE OF t2;
+SELECT t1.a,t2.b FROM fprt1 t1 , ftprt2_p1 t2 WHERE (t1.a = t2.b) AND t2.c like '%0004' ORDER BY 1,2 FOR UPDATE OF t2;
+
RESET enable_partitionwise_join;
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index cc51ae1757..e7c943bdf5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -4315,8 +4315,11 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel)
if (!IS_PARTITIONED_REL(rel))
return;
- /* The relation should have consider_partitionwise_join set. */
- Assert(rel->consider_partitionwise_join);
+ /*
+ * The relation should have consider_partitionwise_join or
+ * consider_asymmetric_join set.
+ */
+ Assert(rel->consider_partitionwise_join ^ rel->consider_asymmetric_join);
/* Guard against stack overflow due to overly deep partition hierarchy. */
check_stack_depth();
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ee23ed7835..cbe724a223 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -146,6 +146,7 @@ bool enable_mergejoin = true;
bool enable_hashjoin = true;
bool enable_gathermerge = true;
bool enable_partitionwise_join = false;
+bool enable_asymmetric_join = true;
bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true;
bool enable_parallel_hash = true;
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index f3a9412d18..6af5e7c79f 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -42,6 +42,12 @@ static void try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1,
RelOptInfo *rel2, RelOptInfo *joinrel,
SpecialJoinInfo *parent_sjinfo,
List *parent_restrictlist);
+static void try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *inner_rel,
+ RelOptInfo *prel,
+ RelOptInfo *joinrel,
+ SpecialJoinInfo *parent_sjinfo,
+ List *parent_restrictlist);
static SpecialJoinInfo *build_child_join_sjinfo(PlannerInfo *root,
SpecialJoinInfo *parent_sjinfo,
Relids left_relids, Relids right_relids);
@@ -1044,6 +1050,13 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
/* Apply partitionwise join technique, if possible. */
try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+ /*
+ * Consider joining inner relation to every leaf of a partitioned relation
+ */
+ try_asymmetric_partitionwise_join(root, rel1, rel2, joinrel,
+ sjinfo, restrictlist);
+
}
@@ -1491,7 +1504,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
check_stack_depth();
/* Nothing to do, if the join relation is not partitioned. */
- if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
+ if (joinrel->part_scheme == NULL || joinrel->nparts == 0 ||
+ !joinrel->consider_partitionwise_join)
return;
/* The join relation should have consider_partitionwise_join set. */
@@ -1507,8 +1521,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
Assert(REL_HAS_ALL_PART_PROPS(rel1) && REL_HAS_ALL_PART_PROPS(rel2));
/* The joining relations should have consider_partitionwise_join set. */
- Assert(rel1->consider_partitionwise_join &&
- rel2->consider_partitionwise_join);
+ Assert((rel1->consider_partitionwise_join || rel1->consider_asymmetric_join) &&
+ (rel2->consider_partitionwise_join || rel2->consider_asymmetric_join));
/*
* The partition scheme of the join relation should match that of the
@@ -1681,6 +1695,208 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
}
}
+/*
+ * DEV NOTE:
+ * Remember to set joinrel->nparts to 0 in the case we reject this way.
+ */
+static void
+try_asymmetric_partitionwise_join(PlannerInfo *root,
+ RelOptInfo *outer,
+ RelOptInfo *inner,
+ RelOptInfo *joinrel,
+ SpecialJoinInfo *parent_sjinfo,
+ List *parent_restrictlist)
+{
+ int cnt_parts;
+ RelOptInfo *inner_rel = inner;
+ RelOptInfo *prel = outer;
+
+ /* Fast path if AJ is forbidden */
+ if (!joinrel->consider_asymmetric_join)
+ return;
+
+ Assert(joinrel->part_scheme != NULL);
+
+ /*
+ * The caller will not juggle the inner and outer sides of the joinrel.
+ * Hence, we need to check reverse order here ifi direct order isn't
+ * acceptable one.
+ */
+ if (!IS_PARTITIONED_REL(prel))
+ {
+ if (parent_sjinfo->jointype != JOIN_INNER)
+ /*
+ * Can't change the order of inputs because left or anti join
+ * depend on the order.
+ */
+ return;
+
+ prel = inner;
+ inner_rel = outer;
+ }
+
+ if (!IS_PARTITIONED_REL(prel))
+ return;
+
+ Assert(REL_HAS_ALL_PART_PROPS(prel));
+
+ /*
+ * Recheck AJ conditions. XXX: do we need just to store allowed inner/outer
+ * combinations in some sort of cache?
+ */
+ if (!is_inner_rel_safe_for_asymmetric_join(root, inner_rel))
+ return;
+
+ /*
+ * Compute partition bounds. Remember, after that point we can't get out
+ * of this routine without paths population.
+ */
+ if (joinrel->boundinfo == NULL)
+ {
+ Assert(joinrel->nparts == -1);
+ joinrel->boundinfo = prel->boundinfo;
+ joinrel->nparts = prel->nparts;
+ joinrel->part_rels =
+ (RelOptInfo **) palloc0(sizeof(RelOptInfo *) * prel->nparts);
+ }
+ else
+ Assert(joinrel->nparts == prel->nparts && joinrel->part_rels != NULL);
+
+ /*
+ * Create child-join relations for this asymmetric join, if those don't
+ * exist. Add paths to child-joins for a pair of child relations
+ * corresponding to the given pair of parent relations.
+ */
+ for (cnt_parts = 0; cnt_parts < joinrel->nparts; cnt_parts++)
+ {
+ RelOptInfo *outer_child;
+ bool child_empty;
+ SpecialJoinInfo *child_sjinfo;
+ List *child_restrictlist;
+ RelOptInfo *child_joinrel;
+ AppendRelInfo **appinfos;
+ int nappinfos;
+
+ outer_child = prel->part_rels[cnt_parts];
+
+ Assert(!IS_DUMMY_REL(inner_rel));
+ child_empty = (outer_child == NULL || IS_DUMMY_REL(outer_child));
+
+ /* Remember, inner already can't be dummy at this point */
+ switch (parent_sjinfo->jointype)
+ {
+ case JOIN_INNER:
+ case JOIN_SEMI:
+ case JOIN_LEFT:
+ case JOIN_ANTI:
+ if (child_empty)
+ continue; /* ignore this join segment */
+ break;
+ default:
+ /* other values not expected here */
+ elog(ERROR, "unrecognized join type: %d",
+ (int) parent_sjinfo->jointype);
+ break;
+ }
+
+ if (outer_child == NULL)
+ {
+ /*
+ * Mark the joinrel as unpartitioned so that later functions treat
+ * it correctly.
+ * It is not obvious that we need return here. But we follow the
+ * commit 7ad6498 and just reduce risk of fault for quite rare case.
+ * Anyway, it can be fixed in the future.
+ */
+ joinrel->nparts = 0;
+ return;
+ }
+
+ /*
+ * XXX:
+ * If we want to provide each child join with personal copy of inner
+ * relation - it is a good place to do.
+ * The main reason why we still don't do it - what if AJ had
+ * implemented over arbitrary inner subtree, not only a baserel? In that
+ * case we inevitable must copy whole subtree of RelOptInfos. But we
+ * don't have standard machinery to do this and even insight on how to
+ * do it with strong guarantees.
+ */
+
+ Assert(!bms_overlap(inner_rel->relids, outer_child->relids));
+
+ /*
+ * Construct SpecialJoinInfo from parent join relations's
+ * SpecialJoinInfo.
+ */
+ child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
+ inner_rel->relids,
+ outer_child->relids);
+
+ /* Find the AppendRelInfo structures */
+ appinfos = find_appinfos_by_relids(root,
+ outer_child->relids,
+ &nappinfos);
+
+ child_restrictlist =
+ (List *) adjust_appendrel_attrs(root,
+ (Node *) parent_restrictlist,
+ nappinfos, appinfos);
+
+ if (joinrel->part_rels[cnt_parts] == NULL)
+ {
+ child_joinrel = build_child_join_rel(root, outer_child, inner_rel,
+ joinrel, child_restrictlist,
+ child_sjinfo);
+ joinrel->part_rels[cnt_parts] = child_joinrel;
+ joinrel->live_parts = bms_add_member(joinrel->live_parts, cnt_parts);
+ joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
+ child_joinrel->relids);
+ }
+ else
+ child_joinrel = joinrel->part_rels[cnt_parts];
+
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check correctness of the adjustment here. It is much faster and
+ * easier to fix issues, detected here, than in setrefs.c module.
+ */
+ {
+ Bitmapset *parent = bms_del_members(bms_copy(joinrel->relids),
+ child_joinrel->relids);
+ ListCell *lc;
+
+ Assert(!bms_is_empty(parent));
+
+ foreach(lc, child_restrictlist)
+ {
+ RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
+
+ if(bms_intersect(parent, rinfo->clause_relids))
+ elog(ERROR, "Incorrect relids in the child join clause");
+ }
+ }
+#endif
+ Assert(bms_equal(child_joinrel->relids,
+ adjust_child_relids(joinrel->relids,
+ nappinfos, appinfos)));
+
+ /* And make paths for the child join */
+ populate_joinrel_with_paths(root, outer_child, inner_rel,
+ child_joinrel, child_sjinfo,
+ child_restrictlist);
+
+ pfree(appinfos);
+
+ /*
+ * free_child_join_sjinfo(child_sjinfo);
+ * We can't free it here, because min_righthand value isn't changed
+ * in the AJ optimisation yet - remember that the same copy of inner
+ * relation participates in each child join.
+ */
+ }
+}
+
/*
* Construct the SpecialJoinInfo for a child-join by translating
* SpecialJoinInfo for the join between parents. left_relids and right_relids
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 3b77886567..7264d1ebd3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4357,6 +4357,8 @@ create_nestloop_plan(PlannerInfo *root,
Relids outerrelids;
List *nestParams;
Relids saveOuterRels = root->curOuterRels;
+ bool needFlatCopy =
+ is_asymmetric_join((Path *) best_path) ? true : false;
/*
* If the inner path is parameterized by the topmost parent of the outer
@@ -4366,7 +4368,8 @@ create_nestloop_plan(PlannerInfo *root,
best_path->jpath.innerjoinpath =
reparameterize_path_by_child(root,
best_path->jpath.innerjoinpath,
- best_path->jpath.outerjoinpath->parent);
+ best_path->jpath.outerjoinpath->parent,
+ needFlatCopy);
/*
* Failure here probably means that reparameterize_path_by_child() is not
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 37abcb4701..dce6124abe 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -319,24 +319,29 @@ set_plan_references(PlannerInfo *root, Plan *plan)
/*
* Adjust RT indexes of AppendRelInfos and add to final appendrels list.
- * We assume the AppendRelInfos were built during planning and don't need
- * to be copied.
+ * The AppendRelInfos are copied, because as a part of a subplan they could
+ * be visited many times in the case of asymmetric join.
*/
foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+ AppendRelInfo *newappinfo;
+
+ /* flat copy is enough since all valuable fields are scalars */
+ newappinfo = (AppendRelInfo *) palloc(sizeof(AppendRelInfo));
+ memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));
/* adjust RT indexes */
- appinfo->parent_relid += rtoffset;
- appinfo->child_relid += rtoffset;
+ newappinfo->parent_relid += rtoffset;
+ newappinfo->child_relid += rtoffset;
/*
* Rather than adjust the translated_vars entries, just drop 'em.
* Neither the executor nor EXPLAIN currently need that data.
*/
- appinfo->translated_vars = NIL;
+ newappinfo->translated_vars = NIL;
- glob->appendRelations = lappend(glob->appendRelations, appinfo);
+ glob->appendRelations = lappend(glob->appendRelations, newappinfo);
}
/* If needed, create workspace for processing AlternativeSubPlans */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224..3b08ea7131 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -202,8 +202,9 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos,
context.nappinfos = nappinfos;
context.appinfos = appinfos;
- /* If there's nothing to adjust, don't call this function. */
- Assert(nappinfos >= 1 && appinfos != NULL);
+ /* If there's nothing to adjust, just return a duplication */
+ if (nappinfos == 0)
+ return copyObject(node);
/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));
@@ -746,13 +747,7 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
AppendRelInfo *appinfo = root->append_rel_array[i];
if (!appinfo)
- {
- /* Probably i is an OJ index, but let's check */
- if (find_base_rel_ignore_join(root, i) == NULL)
- continue;
- /* It's a base rel, but we lack an append_rel_array entry */
- elog(ERROR, "child rel %d not found in append_rel_array", i);
- }
+ continue;
appinfos[cnt++] = appinfo;
}
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3cf1dac087..471e34109e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -51,7 +51,8 @@ static int append_total_cost_compare(const ListCell *a, const ListCell *b);
static int append_startup_cost_compare(const ListCell *a, const ListCell *b);
static List *reparameterize_pathlist_by_child(PlannerInfo *root,
List *pathlist,
- RelOptInfo *child_rel);
+ RelOptInfo *child_rel,
+ bool needFlatCopy);
static bool pathlist_is_reparameterizable_by_child(List *pathlist,
RelOptInfo *child_rel);
@@ -4084,6 +4085,59 @@ reparameterize_path(PlannerInfo *root, Path *path,
return NULL;
}
+
+#define IS_PARTITION(rel) \
+ (rel->reloptkind == RELOPT_OTHER_JOINREL || \
+ rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+
+/*
+ * AJ path is a path which has OTHER_REL parent, non-other, non inherited relation on one side and partition on the opposite side
+ * Partitioned side of the AJ can be RELOPT_OTHER_MEMBER_REL (base table) or
+ * RELOPT_OTHER_JOINREL for a join. May it be OTHER_UPPER_REL? - not for now.
+ * We should differ AJ from trivial UNION ALL.
+ */
+bool
+is_asymmetric_join(Path *path)
+{
+ RelOptInfo *rel = ((Path *) path)->parent;
+ RelOptInfo *plainrel;
+ RelOptInfo *prel;
+ JoinPath *jpath;
+
+ Assert(IsA(path, NestPath) || IsA(path, MergePath) || IsA(path, HashPath));
+ jpath = (JoinPath *) path;
+
+ if (!IS_OTHER_REL(rel))
+ return false;
+
+ Assert(rel->reloptkind != RELOPT_OTHER_UPPER_REL);
+
+ /* Discover left and right sides of the join */
+
+ plainrel = jpath->innerjoinpath->parent;
+ prel = jpath->outerjoinpath->parent;
+
+ /*
+ * RelOptInfo can be implemented by a bushy path tree. In that case we
+ * should dive below, but it is impractical right now - we can't check out
+ * the code and test it because the case doesn't exists.
+ */
+ Assert(prel != plainrel);
+
+ /* Identify prospective inner and outer of the AJ */
+ if (!IS_PARTITION(prel))
+ {
+ /* Inner join allows to change inner and outer sides of AJ */
+ plainrel = prel;
+ prel = jpath->innerjoinpath->parent;
+ }
+
+ if (!IS_PARTITION(prel) || IS_OTHER_REL(plainrel))
+ return false;
+
+ return true;
+}
+
/*
* reparameterize_path_by_child
* Given a path parameterized by the parent of the given child relation,
@@ -4094,6 +4148,11 @@ reparameterize_path(PlannerInfo *root, Path *path,
* recursively reparameterized. Other fields that refer to specific relids
* also need adjustment.
*
+ * In the case of asymmetric join we utilise the same RelOptInfo in different
+ * parts of the plan. Being kludge example of coding it needs to make a copy of
+ * the node before modifying it. Recheck needFlatCopy decision each time it
+ * finds a join node during reparameterization, if it still not needed.
+ *
* The cost, number of rows, width and parallel path properties depend upon
* path->parent, which does not change during the translation. So we need
* not change those.
@@ -4110,13 +4169,22 @@ reparameterize_path(PlannerInfo *root, Path *path,
*/
Path *
reparameterize_path_by_child(PlannerInfo *root, Path *path,
- RelOptInfo *child_rel)
+ RelOptInfo *child_rel, bool needFlatCopy)
{
Path *new_path;
ParamPathInfo *new_ppi;
ParamPathInfo *old_ppi;
Relids required_outer;
+#define FLAT_COPY_PATH(newnode, node, nodetype) \
+ if (needFlatCopy) \
+ { \
+ (newnode) = makeNode(nodetype); \
+ memcpy((newnode), (node), sizeof(nodetype)); \
+ } \
+ else \
+ (newnode) = (nodetype *) node;
+
#define ADJUST_CHILD_ATTRS(node) \
((node) = (void *) adjust_appendrel_attrs_multilevel(root, \
(Node *) (node), \
@@ -4125,7 +4193,7 @@ reparameterize_path_by_child(PlannerInfo *root, Path *path,
#define REPARAMETERIZE_CHILD_PATH(path) \
do { \
- (path) = reparameterize_path_by_child(root, (path), child_rel); \
+ (path) = reparameterize_path_by_child(root, (path), child_rel, needFlatCopy); \
if ((path) == NULL) \
return NULL; \
} while(0)
@@ -4135,7 +4203,7 @@ do { \
if ((pathlist) != NIL) \
{ \
(pathlist) = reparameterize_pathlist_by_child(root, (pathlist), \
- child_rel); \
+ child_rel, needFlatCopy); \
if ((pathlist) == NIL) \
return NULL; \
} \
@@ -4164,7 +4232,7 @@ do { \
switch (nodeTag(path))
{
case T_Path:
- new_path = path;
+ FLAT_COPY_PATH(new_path, path, Path);
ADJUST_CHILD_ATTRS(new_path->parent->baserestrictinfo);
if (path->pathtype == T_SampleScan)
{
@@ -4183,8 +4251,9 @@ do { \
case T_IndexPath:
{
- IndexPath *ipath = (IndexPath *) path;
+ IndexPath *ipath;
+ FLAT_COPY_PATH(ipath, path, IndexPath);
ADJUST_CHILD_ATTRS(ipath->indexinfo->indrestrictinfo);
ADJUST_CHILD_ATTRS(ipath->indexclauses);
new_path = (Path *) ipath;
@@ -4193,8 +4262,9 @@ do { \
case T_BitmapHeapPath:
{
- BitmapHeapPath *bhpath = (BitmapHeapPath *) path;
+ BitmapHeapPath *bhpath;
+ FLAT_COPY_PATH(bhpath, path, BitmapHeapPath);
ADJUST_CHILD_ATTRS(bhpath->path.parent->baserestrictinfo);
REPARAMETERIZE_CHILD_PATH(bhpath->bitmapqual);
new_path = (Path *) bhpath;
@@ -4203,8 +4273,9 @@ do { \
case T_BitmapAndPath:
{
- BitmapAndPath *bapath = (BitmapAndPath *) path;
+ BitmapAndPath *bapath;
+ FLAT_COPY_PATH(bapath, path, BitmapAndPath);
REPARAMETERIZE_CHILD_PATH_LIST(bapath->bitmapquals);
new_path = (Path *) bapath;
}
@@ -4212,8 +4283,9 @@ do { \
case T_BitmapOrPath:
{
- BitmapOrPath *bopath = (BitmapOrPath *) path;
+ BitmapOrPath *bopath;
+ FLAT_COPY_PATH(bopath, path, BitmapOrPath);
REPARAMETERIZE_CHILD_PATH_LIST(bopath->bitmapquals);
new_path = (Path *) bopath;
}
@@ -4221,9 +4293,10 @@ do { \
case T_ForeignPath:
{
- ForeignPath *fpath = (ForeignPath *) path;
+ ForeignPath *fpath;
ReparameterizeForeignPathByChild_function rfpc_func;
+ FLAT_COPY_PATH(fpath, path, ForeignPath);
ADJUST_CHILD_ATTRS(fpath->path.parent->baserestrictinfo);
if (fpath->fdw_outerpath)
REPARAMETERIZE_CHILD_PATH(fpath->fdw_outerpath);
@@ -4242,8 +4315,9 @@ do { \
case T_CustomPath:
{
- CustomPath *cpath = (CustomPath *) path;
+ CustomPath *cpath;
+ FLAT_COPY_PATH(cpath, path, CustomPath);
ADJUST_CHILD_ATTRS(cpath->path.parent->baserestrictinfo);
REPARAMETERIZE_CHILD_PATH_LIST(cpath->custom_paths);
if (cpath->custom_restrictinfo)
@@ -4260,9 +4334,14 @@ do { \
case T_NestPath:
{
- NestPath *npath = (NestPath *) path;
- JoinPath *jpath = (JoinPath *) npath;
+ NestPath *npath;
+ JoinPath *jpath;
+
+ if (!needFlatCopy)
+ needFlatCopy = is_asymmetric_join(path) ? true : false;
+ FLAT_COPY_PATH(npath, path, NestPath);
+ jpath = (JoinPath *) npath;
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
@@ -4272,9 +4351,15 @@ do { \
case T_MergePath:
{
- MergePath *mpath = (MergePath *) path;
- JoinPath *jpath = (JoinPath *) mpath;
+ MergePath *mpath;
+ JoinPath *jpath;
+ if (!needFlatCopy)
+ needFlatCopy = is_asymmetric_join(path) ? true : false;
+
+ FLAT_COPY_PATH(mpath, path, MergePath);
+
+ jpath = (JoinPath *) mpath;
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
@@ -4285,9 +4370,15 @@ do { \
case T_HashPath:
{
- HashPath *hpath = (HashPath *) path;
- JoinPath *jpath = (JoinPath *) hpath;
+ HashPath *hpath;
+ JoinPath *jpath;
+
+ if (!needFlatCopy)
+ needFlatCopy = is_asymmetric_join(path) ? true : false;
+ FLAT_COPY_PATH(hpath, path, HashPath);
+
+ jpath = (JoinPath *) hpath;
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
@@ -4298,8 +4389,9 @@ do { \
case T_AppendPath:
{
- AppendPath *apath = (AppendPath *) path;
+ AppendPath *apath;
+ FLAT_COPY_PATH(apath, path, AppendPath);
REPARAMETERIZE_CHILD_PATH_LIST(apath->subpaths);
new_path = (Path *) apath;
}
@@ -4307,8 +4399,9 @@ do { \
case T_MaterialPath:
{
- MaterialPath *mpath = (MaterialPath *) path;
+ MaterialPath *mpath;
+ FLAT_COPY_PATH(mpath, path, MaterialPath);
REPARAMETERIZE_CHILD_PATH(mpath->subpath);
new_path = (Path *) mpath;
}
@@ -4316,8 +4409,9 @@ do { \
case T_MemoizePath:
{
- MemoizePath *mpath = (MemoizePath *) path;
+ MemoizePath *mpath;
+ FLAT_COPY_PATH(mpath, path, MemoizePath);
REPARAMETERIZE_CHILD_PATH(mpath->subpath);
ADJUST_CHILD_ATTRS(mpath->param_exprs);
new_path = (Path *) mpath;
@@ -4326,8 +4420,9 @@ do { \
case T_GatherPath:
{
- GatherPath *gpath = (GatherPath *) path;
+ GatherPath *gpath;
+ FLAT_COPY_PATH(gpath, path, GatherPath);
REPARAMETERIZE_CHILD_PATH(gpath->subpath);
new_path = (Path *) gpath;
}
@@ -4374,7 +4469,14 @@ do { \
MemoryContextSwitchTo(oldcontext);
}
- bms_free(required_outer);
+
+ /*
+ * If adjust_child_relids_multilevel don't do replacements it returns
+ * the original set, not a copy. It is possible in the case of asymmetric
+ * JOIN and child_rel->relids contains relids only of plane relations.
+ */
+ if (required_outer != old_ppi->ppi_req_outer)
+ bms_free(required_outer);
new_path->param_info = new_ppi;
@@ -4538,7 +4640,8 @@ do { \
static List *
reparameterize_pathlist_by_child(PlannerInfo *root,
List *pathlist,
- RelOptInfo *child_rel)
+ RelOptInfo *child_rel,
+ bool needFlatCopy)
{
ListCell *lc;
List *result = NIL;
@@ -4546,7 +4649,7 @@ reparameterize_pathlist_by_child(PlannerInfo *root,
foreach(lc, pathlist)
{
Path *path = reparameterize_path_by_child(root, lfirst(lc),
- child_rel);
+ child_rel, needFlatCopy);
if (path == NULL)
{
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e05b21c884..a3e774bcf6 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -276,6 +276,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
rel->joininfo = NIL;
rel->has_eclass_joins = false;
rel->consider_partitionwise_join = false; /* might get changed later */
+ rel->consider_asymmetric_join = false;
rel->part_scheme = NULL;
rel->nparts = -1;
rel->boundinfo = NULL;
@@ -693,6 +694,27 @@ build_join_rel(PlannerInfo *root,
outer_rel,
inner_rel,
sjinfo);
+
+#ifdef USE_ASSERT_CHECKING
+ if (enable_asymmetric_join && joinrel->part_scheme == NULL)
+ {
+ /*
+ * Potentially, Asymmetric JOIN can be rejected because of a reason
+ * in case of, for example JOIN(JOIN(P1,R1),P2).
+ * But for a case of JOIN(JOIN(P1,P2),R1) it may be possible.
+ * Right now we can't find any ways for that case, but check it.
+ */
+ restrictlist = restrictlist_ptr ? *restrictlist_ptr :
+ build_joinrel_restrictlist(root,
+ joinrel,
+ outer_rel,
+ inner_rel,
+ sjinfo);
+ build_joinrel_partition_info(root, joinrel, outer_rel,
+ inner_rel, sjinfo, restrictlist);
+ Assert(joinrel->part_scheme == NULL);
+ }
+#endif
return joinrel;
}
@@ -755,6 +777,7 @@ build_join_rel(PlannerInfo *root,
joinrel->joininfo = NIL;
joinrel->has_eclass_joins = false;
joinrel->consider_partitionwise_join = false; /* might get changed later */
+ joinrel->consider_asymmetric_join = false;
joinrel->parent = NULL;
joinrel->top_parent = NULL;
joinrel->top_parent_relids = NULL;
@@ -886,11 +909,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
AppendRelInfo **appinfos;
int nappinfos;
- /* Only joins between "other" relations land here. */
- Assert(IS_OTHER_REL(outer_rel) && IS_OTHER_REL(inner_rel));
-
- /* The parent joinrel should have consider_partitionwise_join set. */
- Assert(parent_joinrel->consider_partitionwise_join);
+ /* Either of relations must be "other" relation at least. */
+ Assert(IS_OTHER_REL(outer_rel) || IS_OTHER_REL(inner_rel));
/*
* Find the AppendRelInfo structures for the child baserels. We'll need
@@ -949,6 +969,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->joininfo = NIL;
joinrel->has_eclass_joins = false;
joinrel->consider_partitionwise_join = false; /* might get changed later */
+ joinrel->consider_asymmetric_join = false;
joinrel->parent = parent_joinrel;
joinrel->top_parent = parent_joinrel->top_parent ? parent_joinrel->top_parent : parent_joinrel;
joinrel->top_parent_relids = joinrel->top_parent->relids;
@@ -2007,6 +2028,128 @@ get_param_path_clause_serials(Path *path)
}
}
+/*
+ * Check inner rel properties, required to participate in asymmetric join
+ * XXX: Looks like it needs a lot of refactoring
+ */
+bool
+is_inner_rel_safe_for_asymmetric_join(PlannerInfo *root, RelOptInfo *inner_rel)
+{
+ int i = -1;
+
+ /*
+ * Don't allow AJ with partitioned inner.
+ * Also, don't allow lateral references to avoid re-parameterization issues
+ * with relations like TABLESAMPLE, where repartitioning code changes RTE.
+ */
+ if (inner_rel->part_scheme != NULL ||
+ !bms_is_empty(inner_rel->lateral_relids))
+ return false;
+
+ while ((i = bms_next_member(inner_rel->relids, i)) > 0)
+ {
+ RangeTblEntry *rte;
+
+ /*
+ * No one relation in the underlying inner join tree is a partitioned
+ * one: we can't allow the situation when the part_rels would
+ * not correspond partitioning schema.
+ */
+ if (root->simple_rel_array[i] &&
+ root->simple_rel_array[i]->part_scheme != NULL)
+ return false;
+
+ rte = root->simple_rte_array[i];
+ Assert(rte);
+
+ switch (rte->rtekind)
+ {
+ case RTE_RELATION:
+ if (rte->tablesample != NULL)
+ return false;
+ /* Allow asymmetric join */
+ case RTE_JOIN:
+ break;
+ case RTE_FUNCTION:
+ {
+ ListCell *lc;
+
+ foreach(lc, rte->functions)
+ {
+ RangeTblFunction *rtfunc = (RangeTblFunction *) lfirst(lc);
+
+ /* Allow only immutable functions on constants */
+ if (contain_var_clause(rtfunc->funcexpr) ||
+ contain_mutable_functions(rtfunc->funcexpr) ||
+ contain_subplans(rtfunc->funcexpr))
+ return false;
+ }
+ break;
+ }
+
+ /*
+ * Prohibit following RTEs in the inner of AJ.
+ * Right now it means we don't invest enough time into analysis of
+ * consecuences and corner cases of AJ, applied to inner, containing
+ * such RTE. This list can be revised in the future.
+ */
+ case RTE_VALUES:
+ case RTE_TABLEFUNC:
+ case RTE_SUBQUERY:
+ case RTE_CTE:
+ case RTE_NAMEDTUPLESTORE:
+ case RTE_RESULT:
+ return false;
+
+ default:
+ elog(ERROR, "unsupported RTE type %u", rte->rtekind);
+ break;
+ }
+ }
+ return true;
+}
+
+static void
+build_joinrel_partition_info_asymm(PlannerInfo *root,
+ RelOptInfo *joinrel,
+ RelOptInfo *prel,
+ RelOptInfo *inner_rel,
+ SpecialJoinInfo *sjinfo,
+ List *restrictlist)
+{
+ /* Check feasibility */
+
+ if (!IS_PARTITIONED_REL(prel))
+ return;
+
+ /* Enable asymmetric join only if target list doesn't contain WholeRowVar */
+ if ((prel->attr_needed &&
+ !bms_is_empty(prel->attr_needed[InvalidAttrNumber - prel->min_attr])) ||
+ (inner_rel->attr_needed &&
+ !bms_is_empty(inner_rel->attr_needed[InvalidAttrNumber - inner_rel->min_attr])))
+ return;
+
+ /* Check that inner rel is suitable */
+ if (!is_inner_rel_safe_for_asymmetric_join(root, inner_rel))
+ return;
+
+ /*
+ * Dev NOTE:
+ * boundinfo, nparts and part_rels will be initialized later -
+ * See code of the compute_partition_bounds as an explanation.
+ */
+ joinrel->part_scheme = prel->part_scheme;
+
+ set_joinrel_partition_key_exprs(joinrel, prel, inner_rel, sjinfo->jointype);
+ joinrel->consider_asymmetric_join = true;
+
+ /*
+ * It is impossible to be successful in both partitionwise strategies.
+ * At least for now
+ */
+ Assert(!joinrel->consider_partitionwise_join);
+}
+
/*
* build_joinrel_partition_info
* Checks if the two relations being joined can use partitionwise join
@@ -2021,11 +2164,14 @@ build_joinrel_partition_info(PlannerInfo *root,
{
PartitionScheme part_scheme;
+ if (joinrel->consider_partitionwise_join || joinrel->consider_asymmetric_join)
+ return;
+
/* Nothing to do if partitionwise join technique is disabled. */
if (!enable_partitionwise_join)
{
Assert(!IS_PARTITIONED_REL(joinrel));
- return;
+ goto asymmetric;
}
/*
@@ -2040,20 +2186,20 @@ build_joinrel_partition_info(PlannerInfo *root,
* the joins. Please see optimizer/README for details.
*/
if (outer_rel->part_scheme == NULL || inner_rel->part_scheme == NULL ||
- !outer_rel->consider_partitionwise_join ||
- !inner_rel->consider_partitionwise_join ||
+ (!outer_rel->consider_partitionwise_join && !outer_rel->consider_asymmetric_join) ||
+ (!inner_rel->consider_partitionwise_join && !inner_rel->consider_asymmetric_join) ||
outer_rel->part_scheme != inner_rel->part_scheme ||
!have_partkey_equi_join(root, joinrel, outer_rel, inner_rel,
sjinfo->jointype, restrictlist))
{
Assert(!IS_PARTITIONED_REL(joinrel));
- return;
+ goto asymmetric;
}
part_scheme = outer_rel->part_scheme;
/*
- * This function will be called only once for each joinrel, hence it
+ * This place can be achieved only once for each joinrel, hence it
* should not have partitioning fields filled yet.
*/
Assert(!joinrel->part_scheme && !joinrel->partexprs &&
@@ -2074,9 +2220,25 @@ build_joinrel_partition_info(PlannerInfo *root,
/*
* Set the consider_partitionwise_join flag.
*/
- Assert(outer_rel->consider_partitionwise_join);
- Assert(inner_rel->consider_partitionwise_join);
+ Assert(outer_rel->consider_partitionwise_join ^ outer_rel->consider_asymmetric_join);
+ Assert(inner_rel->consider_partitionwise_join ^ inner_rel->consider_asymmetric_join);
joinrel->consider_partitionwise_join = true;
+
+ /* Don't consider asymmetric join after successful partitioned decision. */
+ return;
+
+asymmetric:
+ if (!enable_asymmetric_join || sjinfo->jointype == JOIN_FULL)
+ return;
+
+ /*
+ * Try to setup asymmetric join scheme. Because we are here only once for
+ * the combination of inner and outer, we must check both variants at once.
+ */
+ build_joinrel_partition_info_asymm(root, joinrel, outer_rel,
+ inner_rel, sjinfo, restrictlist);
+ build_joinrel_partition_info_asymm(root, joinrel, inner_rel,
+ outer_rel, sjinfo, restrictlist);
}
/*
@@ -2300,8 +2462,10 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel,
/* mark these const to enforce that we copy them properly */
const List *outer_expr = outer_rel->partexprs[cnt];
const List *outer_null_expr = outer_rel->nullable_partexprs[cnt];
- const List *inner_expr = inner_rel->partexprs[cnt];
- const List *inner_null_expr = inner_rel->nullable_partexprs[cnt];
+ const List *inner_expr = (inner_rel->partexprs != NULL) ?
+ inner_rel->partexprs[cnt] : NIL;
+ const List *inner_null_expr = (inner_rel->nullable_partexprs != NULL) ?
+ inner_rel->nullable_partexprs[cnt] : NIL;
List *partexpr = NIL;
List *nullable_partexpr = NIL;
ListCell *lc;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3fd0b14dd8..e9407edcd0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -919,6 +919,16 @@ struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_asymmetric_join", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables asymmetric partitionwise join."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_asymmetric_join,
+ true,
+ NULL, NULL, NULL
+ },
{
{"enable_partitionwise_aggregate", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables partitionwise aggregation and grouping."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2166ea4a87..74f9a7c4b8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -407,6 +407,7 @@
#enable_parallel_hash = on
#enable_partition_pruning = on
#enable_partitionwise_join = off
+#enable_asymmetric_join = on
#enable_partitionwise_aggregate = off
#enable_presorted_aggregate = on
#enable_seqscan = on
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 91a6ce90d8..ab0fe2fb3f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -988,6 +988,8 @@ typedef struct RelOptInfo
/* consider partitionwise join paths? (if partitioned rel) */
bool consider_partitionwise_join;
+ bool consider_asymmetric_join;
+
/*
* inheritance links, if this is an otherrel (otherwise NULL):
*/
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b1c51a4e70..8e4f864e12 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -64,6 +64,7 @@ extern PGDLLIMPORT bool enable_mergejoin;
extern PGDLLIMPORT bool enable_hashjoin;
extern PGDLLIMPORT bool enable_gathermerge;
extern PGDLLIMPORT bool enable_partitionwise_join;
+extern PGDLLIMPORT bool enable_asymmetric_join;
extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index c5c4756b0f..fd9237a87a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -297,14 +297,18 @@ extern void adjust_limit_rows_costs(double *rows,
extern Path *reparameterize_path(PlannerInfo *root, Path *path,
Relids required_outer,
double loop_count);
+extern bool is_asymmetric_join(Path *path);
extern Path *reparameterize_path_by_child(PlannerInfo *root, Path *path,
- RelOptInfo *child_rel);
+ RelOptInfo *child_rel,
+ bool needFlatCopy);
extern bool path_is_reparameterizable_by_child(Path *path,
RelOptInfo *child_rel);
/*
* prototypes for relnode.c
*/
+extern bool is_inner_rel_safe_for_asymmetric_join(PlannerInfo *root,
+ RelOptInfo *rel);
extern void setup_simple_rel_arrays(PlannerInfo *root);
extern void expand_planner_arrays(PlannerInfo *root, int add_size);
extern RelOptInfo *build_simple_rel(PlannerInfo *root, int relid,
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6d07f86b9b..7b364d4bea 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -2535,6 +2535,792 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
375 | 0375 | 375 | 0375
(8 rows)
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+SET max_parallel_workers_per_gather = 0;
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- AJ: the case of WholeRowVar
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1 FROM prt1 t1, prt2_p1 t2 WHERE (t1.a = t2.b);
+ QUERY PLAN
+--------------------------------------
+ Hash Join
+ Hash Cond: (t1.a = t2.b)
+ -> Append
+ -> Seq Scan on prt1_p1 t1_1
+ -> Seq Scan on prt1_p2 t1_2
+ -> Seq Scan on prt1_p3 t1_3
+ -> Hash
+ -> Seq Scan on prt2_p1 t2
+(8 rows)
+
+-- AJ with partitioned table (one partition has pruned)
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1
+FROM prt1 t1, prt2_p1 t2
+WHERE (t1.a = t2.b) AND (t1.a = 542 OR t1.a = 1);
+ QUERY PLAN
+----------------------------------------------------
+ Hash Join
+ Hash Cond: (t2.b = t1.a)
+ -> Seq Scan on prt2_p1 t2
+ -> Hash
+ -> Append
+ -> Seq Scan on prt1_p1 t1_1
+ Filter: ((a = 542) OR (a = 1))
+ -> Seq Scan on prt1_p3 t1_2
+ Filter: ((a = 542) OR (a = 1))
+(9 rows)
+
+-- Test AJ on anti join
+-- First test mostly for the future SJE with support of partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt1 t1, prt2_p1 t2
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.a);
+ QUERY PLAN
+----------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Nested Loop
+ -> Hash Anti Join
+ Hash Cond: (t1_1.a = prt1_1.a)
+ -> Seq Scan on prt1_p1 t1_1
+ -> Hash
+ -> Seq Scan on prt1_p1 prt1_1
+ -> Seq Scan on prt2_p1 t2
+ -> Nested Loop
+ -> Hash Anti Join
+ Hash Cond: (t1_2.a = prt1_2.a)
+ -> Seq Scan on prt1_p2 t1_2
+ -> Hash
+ -> Seq Scan on prt1_p2 prt1_2
+ -> Seq Scan on prt2_p1 t2
+ -> Nested Loop
+ -> Hash Anti Join
+ Hash Cond: (t1_3.a = prt1_3.a)
+ -> Seq Scan on prt1_p3 t1_3
+ -> Hash
+ -> Seq Scan on prt1_p3 prt1_3
+ -> Seq Scan on prt2_p1 t2
+(23 rows)
+
+SELECT count(*) FROM prt1 t1, prt2_p1 t2
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.a);
+ count
+-------
+ 0
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt2_p1 t1
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.b);
+ QUERY PLAN
+----------------------------------------------
+ Aggregate
+ -> Hash Right Anti Join
+ Hash Cond: (prt1.a = t1.b)
+ -> Append
+ -> Seq Scan on prt1_p1 prt1_1
+ -> Seq Scan on prt1_p2 prt1_2
+ -> Seq Scan on prt1_p3 prt1_3
+ -> Hash
+ -> Seq Scan on prt2_p1 t1
+(9 rows)
+
+SELECT count(*) FROM prt2_p1 t1
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.b);
+ count
+-------
+ 42
+(1 row)
+
+-- Can't use AJ because of complex inner as well as PWJ (one partition pruned)
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1
+FROM prt1 t1, prt2_p1 t2, prt1 t3
+WHERE (t1.a = t2.b) AND (t1.a = 542 OR t1.a = 1) AND t2.a = t3.a;
+ QUERY PLAN
+----------------------------------------------------------------
+ Merge Join
+ Merge Cond: (t3.a = t2.a)
+ -> Merge Append
+ Sort Key: t3.a
+ -> Index Only Scan using iprt1_p1_a on prt1_p1 t3_1
+ -> Index Only Scan using iprt1_p2_a on prt1_p2 t3_2
+ -> Index Only Scan using iprt1_p3_a on prt1_p3 t3_3
+ -> Sort
+ Sort Key: t2.a
+ -> Hash Join
+ Hash Cond: (t2.b = t1.a)
+ -> Seq Scan on prt2_p1 t2
+ -> Hash
+ -> Append
+ -> Seq Scan on prt1_p1 t1_1
+ Filter: ((a = 542) OR (a = 1))
+ -> Seq Scan on prt1_p3 t1_2
+ Filter: ((a = 542) OR (a = 1))
+(18 rows)
+
+-- TODO: We don't see any AJ or PWJ, but may be it's a game of costs?
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1
+FROM prt1 t1, prt2_p1 t2, prt1 t3
+WHERE (t1.a = t2.b) AND (t1.a = 542 OR t1.a = 1)
+ AND (t2.a = t3.a) AND (t3.a = 543 OR t3.a = 2);
+ QUERY PLAN
+--------------------------------------------------------------
+ Nested Loop
+ -> Hash Join
+ Hash Cond: (t2.b = t1.a)
+ -> Seq Scan on prt2_p1 t2
+ -> Hash
+ -> Append
+ -> Seq Scan on prt1_p1 t1_1
+ Filter: ((a = 542) OR (a = 1))
+ -> Seq Scan on prt1_p3 t1_2
+ Filter: ((a = 542) OR (a = 1))
+ -> Append
+ -> Index Only Scan using iprt1_p1_a on prt1_p1 t3_1
+ Index Cond: (a = t2.a)
+ Filter: ((a = 543) OR (a = 2))
+ -> Index Only Scan using iprt1_p3_a on prt1_p3 t3_2
+ Index Cond: (a = t2.a)
+ Filter: ((a = 543) OR (a = 2))
+(17 rows)
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------
+ Hash Join
+ Hash Cond: (prt5_p0.a = t5_1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0
+ -> Seq Scan on prt5_p1
+ -> Seq Scan on prt5_p2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(9 rows)
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (prt5.a = prt6.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Seq Scan on prt6_p1 prt6_2
+ Filter: (alabel ~~ '%abc%'::text)
+(13 rows)
+
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 4000
+(1 row)
+
+-- Check that asymmetric JOIN with Subquery is forbidden
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+----------------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (prt5.a = sq1.aid)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Subquery Scan on sq1
+ Filter: (sq1.alabel ~~ '%abc%'::text)
+ -> Limit
+ -> Append
+ -> Seq Scan on prt6_p0 prt6_1
+ -> Seq Scan on prt6_p1 prt6_2
+(14 rows)
+
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+ count
+-------
+ 2000
+(1 row)
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ QUERY PLAN
+------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Hash Join
+ Hash Cond: (prt5_1.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_2.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+ -> Hash Join
+ Hash Cond: (prt5_3.b = t5_2.bid)
+ -> Hash Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%ab%'::text)
+ -> Hash
+ -> Seq Scan on t5_2
+ Filter: (blabel ~~ '%cd%'::text)
+(35 rows)
+
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+ count
+-------
+ 11000
+(1 row)
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-----------------------------------------------
+ Hash Right Join
+ Hash Cond: (prt5.a = t5_1.aid)
+ Join Filter: (t5_1.alabel ~~ '%abc%'::text)
+ -> Append
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+(9 rows)
+
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+ QUERY PLAN
+-------------------------------------------------------
+ Append
+ -> Hash Left Join
+ Hash Cond: (prt5_1.a = t5_1.aid)
+ -> Seq Scan on prt5_p0 prt5_1
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Left Join
+ Hash Cond: (prt5_2.a = t5_1.aid)
+ -> Seq Scan on prt5_p1 prt5_2
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+ -> Hash Left Join
+ Hash Cond: (prt5_3.a = t5_1.aid)
+ -> Seq Scan on prt5_p2 prt5_3
+ -> Hash
+ -> Seq Scan on t5_1
+ Filter: (alabel ~~ '%abc%'::text)
+(19 rows)
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SET enable_partitionwise_join = off;
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+ hkey | a | b | aid | alabel
+------+---+---+-----+--------
+(0 rows)
+
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+ hkey | a | b | aid | alabel | bid | blabel
+------+---+---+-----+--------+-----+--------
+(0 rows)
+
+SET enable_partitionwise_join = on;
+-- Trying different JOIN combinations optimiser can provide partitioned relation
+-- buried deeply in the inner subtree. Check that it doesn't provide
+-- inconsistency in the plan.
+EXPLAIN (COSTS OFF)
+SELECT * from prt1 d1, prt2 d2, unnest(array[3,4]) n, unnest(array[3,4]) g
+WHERE d1.a = n AND d2.b = d1.a and g = n;
+ QUERY PLAN
+--------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (d1_1.a = g.g)
+ -> Hash Join
+ Hash Cond: (d1_1.a = n.n)
+ -> Hash Join
+ Hash Cond: (d1_1.a = d2_1.b)
+ -> Seq Scan on prt1_p1 d1_1
+ -> Hash
+ -> Seq Scan on prt2_p1 d2_1
+ -> Hash
+ -> Function Scan on unnest n
+ -> Function Scan on unnest g
+ -> Nested Loop
+ Join Filter: (d1_2.a = g.g)
+ -> Hash Join
+ Hash Cond: (d1_2.a = n.n)
+ -> Hash Join
+ Hash Cond: (d1_2.a = d2_2.b)
+ -> Seq Scan on prt1_p2 d1_2
+ -> Hash
+ -> Seq Scan on prt2_p2 d2_2
+ -> Hash
+ -> Function Scan on unnest n
+ -> Function Scan on unnest g
+ -> Nested Loop
+ Join Filter: (d1_3.a = g.g)
+ -> Hash Join
+ Hash Cond: (d1_3.a = n.n)
+ -> Hash Join
+ Hash Cond: (d1_3.a = d2_3.b)
+ -> Seq Scan on prt1_p3 d1_3
+ -> Hash
+ -> Seq Scan on prt2_p3 d2_3
+ -> Hash
+ -> Function Scan on unnest n
+ -> Function Scan on unnest g
+(37 rows)
+
+-- TODO:
+-- According to current logic decision on AJ or PWJ for specific joinrel
+-- can depend on the order of relations in the FROM list. See how it works.
+-- Can we resolve this issue somehow?
+EXPLAIN (COSTS OFF) -- PWJ on top
+SELECT * from prt1 d1, unnest(array[3,4]) n, prt2 d2
+WHERE d1.a = n AND d2.b = d1.a;
+ QUERY PLAN
+---------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (d1_1.a = n.n)
+ -> Hash Join
+ Hash Cond: (d2_1.b = n.n)
+ -> Seq Scan on prt2_p1 d2_1
+ -> Hash
+ -> Function Scan on unnest n
+ -> Index Scan using iprt1_p1_a on prt1_p1 d1_1
+ Index Cond: (a = d2_1.b)
+ -> Nested Loop
+ Join Filter: (d1_2.a = n.n)
+ -> Hash Join
+ Hash Cond: (d2_2.b = n.n)
+ -> Seq Scan on prt2_p2 d2_2
+ -> Hash
+ -> Function Scan on unnest n
+ -> Index Scan using iprt1_p2_a on prt1_p2 d1_2
+ Index Cond: (a = d2_2.b)
+ -> Nested Loop
+ Join Filter: (d1_3.a = n.n)
+ -> Hash Join
+ Hash Cond: (d2_3.b = n.n)
+ -> Seq Scan on prt2_p3 d2_3
+ -> Hash
+ -> Function Scan on unnest n
+ -> Index Scan using iprt1_p3_a on prt1_p3 d1_3
+ Index Cond: (a = d2_3.b)
+(28 rows)
+
+EXPLAIN (COSTS OFF) -- AJ on top
+SELECT * from prt1 d1, prt2 d2, unnest(array[3,4]) n
+WHERE d1.a = n AND d2.b = d1.a;
+ QUERY PLAN
+--------------------------------------------------
+ Append
+ -> Hash Join
+ Hash Cond: (d1_1.a = n.n)
+ -> Hash Join
+ Hash Cond: (d1_1.a = d2_1.b)
+ -> Seq Scan on prt1_p1 d1_1
+ -> Hash
+ -> Seq Scan on prt2_p1 d2_1
+ -> Hash
+ -> Function Scan on unnest n
+ -> Hash Join
+ Hash Cond: (d1_2.a = n.n)
+ -> Hash Join
+ Hash Cond: (d1_2.a = d2_2.b)
+ -> Seq Scan on prt1_p2 d1_2
+ -> Hash
+ -> Seq Scan on prt2_p2 d2_2
+ -> Hash
+ -> Function Scan on unnest n
+ -> Hash Join
+ Hash Cond: (d1_3.a = n.n)
+ -> Hash Join
+ Hash Cond: (d1_3.a = d2_3.b)
+ -> Seq Scan on prt1_p3 d1_3
+ -> Hash
+ -> Seq Scan on prt2_p3 d2_3
+ -> Hash
+ -> Function Scan on unnest n
+(28 rows)
+
+-- Allow JOINs in the inner side of AJ
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt1 d1 LEFT JOIN (unnest(array[3,4]) n
+ FULL OUTER JOIN unnest(array[3,4]) g ON (n=g))
+ON (d1.a = n);
+ QUERY PLAN
+---------------------------------------------------------
+ Append
+ -> Hash Left Join
+ Hash Cond: (d1_1.a = n.n)
+ -> Seq Scan on prt1_p1 d1_1
+ -> Hash
+ -> Hash Left Join
+ Hash Cond: (n.n = g.g)
+ -> Function Scan on unnest n
+ -> Hash
+ -> Function Scan on unnest g
+ -> Hash Left Join
+ Hash Cond: (d1_2.a = n.n)
+ -> Seq Scan on prt1_p2 d1_2
+ -> Hash
+ -> Hash Left Join
+ Hash Cond: (n.n = g.g)
+ -> Function Scan on unnest n
+ -> Hash
+ -> Function Scan on unnest g
+ -> Hash Left Join
+ Hash Cond: (d1_3.a = n.n)
+ -> Seq Scan on prt1_p3 d1_3
+ -> Hash
+ -> Hash Left Join
+ Hash Cond: (n.n = g.g)
+ -> Function Scan on unnest n
+ -> Hash
+ -> Function Scan on unnest g
+(28 rows)
+
+-- Check reparameterization code when an optimizer have to make two level relids
+-- adjustment.
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+ANALYZE big,small,part_l0;
+-- Parameter have to be reparameterized by a plane relation.
+EXPLAIN (COSTS OFF)
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------
+ Append
+ -> Nested Loop Left Join
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part0_l2 part_l0_1
+ -> Index Only Scan using big_x_idx on big
+ Index Cond: (x = (small.x)::numeric)
+ -> Nested Loop Left Join
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2
+ -> Seq Scan on small
+ -> Index Only Scan using big_x_idx on big
+ Index Cond: (x = (small.x)::numeric)
+ -> Nested Loop Left Join
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part1_l1 part_l0_3
+ -> Index Only Scan using big_x_idx on big
+ Index Cond: (x = (small.x)::numeric)
+(22 rows)
+
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+ x | y
+---+---
+(0 rows)
+
+-- Parameters have to be reparameterized by plane and partitioned relations.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------
+ Aggregate
+ -> Append
+ -> Nested Loop Left Join
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_1.y) AND ((small.y + part_l0_1.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part0_l2 part_l0_1
+ -> Bitmap Heap Scan on big
+ Recheck Cond: ((x = (small.x)::numeric) OR (x = (part_l0_1.x)::numeric))
+ -> BitmapOr
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (small.x)::numeric)
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (part_l0_1.x)::numeric)
+ -> Nested Loop Left Join
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_2.y) AND ((small.y + part_l0_2.y) < 10))
+ -> Seq Scan on part1_l2 part_l0_2
+ -> Seq Scan on small
+ -> Bitmap Heap Scan on big
+ Recheck Cond: ((x = (small.x)::numeric) OR (x = (part_l0_2.x)::numeric))
+ -> BitmapOr
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (small.x)::numeric)
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (part_l0_2.x)::numeric)
+ -> Nested Loop Left Join
+ -> Nested Loop
+ Join Filter: ((small.y = part_l0_3.y) AND ((small.y + part_l0_3.y) < 10))
+ -> Seq Scan on small
+ -> Seq Scan on part1_l1 part_l0_3
+ -> Bitmap Heap Scan on big
+ Recheck Cond: ((x = (small.x)::numeric) OR (x = (part_l0_3.x)::numeric))
+ -> BitmapOr
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (small.x)::numeric)
+ -> Bitmap Index Scan on big_x_idx
+ Index Cond: (x = (part_l0_3.x)::numeric)
+(38 rows)
+
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+ count
+-------
+ 0
+(1 row)
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+--
+-- The case of AJ under partitionwise PNL which needs reparameterization
+--
+CREATE TABLE small (x int);
+CREATE TABLE big (x int, y int);
+CREATE TABLE part42 (x int, y int) PARTITION BY RANGE (x);
+CREATE TABLE p1 PARTITION OF part42 (x) FOR VALUES FROM (0) TO (50);
+CREATE TABLE p2 PARTITION OF part42 (x) FOR VALUES FROM (50) TO (101);
+INSERT INTO big (SELECT gs%10,gs FROM generate_series(1,10000) AS gs);
+INSERT INTO small (x) SELECT gs FROM generate_series (1,1) AS gs;
+INSERT INTO part42 SELECT gs%100, gs%100 FROM generate_series(1,100000) AS gs;
+CREATE INDEX big_idx ON big(y);
+VACUUM ANALYZE small,big,part42;
+SET enable_hashjoin = off;
+SET enable_mergejoin = off;
+SET max_parallel_workers_per_gather = 0; -- Reduce test unstability
+EXPLAIN (COSTS OFF)
+SELECT * FROM part42 ext LEFT JOIN big
+ JOIN part42 p2 ON (p2.x=big.y)
+ON (ext.x=p2.x AND ext.y=big.y);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop Left Join
+ Join Filter: ((ext_1.x = p2_1.x) AND (ext_1.y = big.y))
+ -> Seq Scan on p1 ext_1
+ -> Materialize
+ -> Nested Loop
+ -> Seq Scan on p1 p2_1
+ -> Memoize
+ Cache Key: p2_1.x
+ Cache Mode: logical
+ -> Index Scan using big_idx on big
+ Index Cond: (y = p2_1.x)
+ -> Nested Loop Left Join
+ Join Filter: ((ext_2.x = p2_2.x) AND (ext_2.y = big.y))
+ -> Seq Scan on p2 ext_2
+ -> Materialize
+ -> Nested Loop
+ -> Seq Scan on p2 p2_2
+ -> Memoize
+ Cache Key: p2_2.x
+ Cache Mode: logical
+ -> Index Scan using big_idx on big
+ Index Cond: (y = p2_2.x)
+(23 rows)
+
+RESET max_parallel_workers_per_gather;
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+ANALYZE prta,prtb,e;
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+ QUERY PLAN
+-----------------------------------------------------------
+ Append
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+(9 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+ QUERY PLAN
+-----------------------------------------------------------------
+ Append
+ -> Nested Loop
+ Join Filter: (prta_1.id = e_1.id)
+ -> Nested Loop
+ -> Seq Scan on prta1 prta_1
+ -> Index Scan using prtb1_id_idx on prtb1 prtb_1
+ Index Cond: (id = prta_1.id)
+ -> Index Scan using e1_id_idx on e1 e_1
+ Index Cond: (id = prtb_1.id)
+ -> Nested Loop
+ Join Filter: (prta_2.id = e_2.id)
+ -> Nested Loop
+ -> Seq Scan on prta2 prta_2
+ -> Index Scan using prtb2_id_idx on prtb2 prtb_2
+ Index Cond: (id = prta_2.id)
+ -> Index Scan using e2_id_idx on e2 e_2
+ Index Cond: (id = prtb_2.id)
+(17 rows)
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
@@ -3380,6 +4166,30 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
DROP TABLE prt1_adv;
DROP TABLE prt2_adv;
+DROP TABLE prt1;
+-- Test function scan with lateral relids
+CREATE TABLE prt1 (id int, data json) PARTITION BY HASH(id);
+CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES WITH (MODULUS 3, REMAINDER 0);
+CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES WITH (MODULUS 3, REMAINDER 1);
+CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES WITH (MODULUS 3, REMAINDER 2);
+INSERT INTO prt1 SELECT i, format('{ "name": "value%s", "Test": {"a":"b"}}', i)::json from generate_series(1,100) i;
+ANALYZE prt1;
+SELECT id, key, value from prt1, json_each(data->'Test') ORDER BY id LIMIT 10;
+ id | key | value
+----+-----+-------
+ 1 | a | "b"
+ 2 | a | "b"
+ 3 | a | "b"
+ 4 | a | "b"
+ 5 | a | "b"
+ 6 | a | "b"
+ 7 | a | "b"
+ 8 | a | "b"
+ 9 | a | "b"
+ 10 | a | "b"
+(10 rows)
+
+DROP TABLE prt1;
-- Tests for list-partitioned tables
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_adv_p1 PARTITION OF plt1_adv FOR VALUES IN ('0001', '0003');
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 46b78ba3c4..4bfd92736b 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2485,32 +2485,56 @@ select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on
--------------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=N loops=N)
-> Gather (actual rows=N loops=N)
- Workers Planned: 1
+ Workers Planned: 2
Workers Launched: N
-> Partial Aggregate (actual rows=N loops=N)
- -> Nested Loop (actual rows=N loops=N)
- -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
- Filter: (a = ANY ('{0,0,1}'::integer[]))
- -> Append (actual rows=N loops=N)
+ -> Parallel Append (actual rows=N loops=N)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
-> Index Scan using ab_a1_b1_a_idx on ab_a1_b1 ab_1 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
-> Index Scan using ab_a1_b2_a_idx on ab_a1_b2 ab_2 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
-> Index Scan using ab_a1_b3_a_idx on ab_a1_b3 ab_3 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (actual rows=N loops=N)
Index Cond: (a = a.a)
-(27 rows)
+(51 rows)
-- Ensure the same partitions are pruned when we make the nested loop
-- parameter an Expr rather than a plain Param.
@@ -2519,32 +2543,56 @@ select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on
--------------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=N loops=N)
-> Gather (actual rows=N loops=N)
- Workers Planned: 1
+ Workers Planned: 2
Workers Launched: N
-> Partial Aggregate (actual rows=N loops=N)
- -> Nested Loop (actual rows=N loops=N)
- -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
- Filter: (a = ANY ('{0,0,1}'::integer[]))
- -> Append (actual rows=N loops=N)
+ -> Parallel Append (actual rows=N loops=N)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
-> Index Scan using ab_a1_b1_a_idx on ab_a1_b1 ab_1 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
-> Index Scan using ab_a1_b2_a_idx on ab_a1_b2 ab_2 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
-> Index Scan using ab_a1_b3_a_idx on ab_a1_b3 ab_3 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
- -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
- -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
- -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
- -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
- -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
- -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{0,0,1}'::integer[]))
+ -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (actual rows=N loops=N)
Index Cond: (a = (a.a + 0))
-(27 rows)
+(51 rows)
insert into lprt_a values(3),(3);
select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 3)');
@@ -2552,99 +2600,187 @@ select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on
--------------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=N loops=N)
-> Gather (actual rows=N loops=N)
- Workers Planned: 1
+ Workers Planned: 2
Workers Launched: N
-> Partial Aggregate (actual rows=N loops=N)
- -> Nested Loop (actual rows=N loops=N)
- -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
- Filter: (a = ANY ('{1,0,3}'::integer[]))
- -> Append (actual rows=N loops=N)
+ -> Parallel Append (actual rows=N loops=N)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
-> Index Scan using ab_a1_b1_a_idx on ab_a1_b1 ab_1 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
-> Index Scan using ab_a1_b2_a_idx on ab_a1_b2 ab_2 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
-> Index Scan using ab_a1_b3_a_idx on ab_a1_b3 ab_3 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
+ -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
+ -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
+ -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
-> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
-> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,3}'::integer[]))
-> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (actual rows=N loops=N)
Index Cond: (a = a.a)
-(27 rows)
+(51 rows)
select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
explain_parallel_append
--------------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=N loops=N)
-> Gather (actual rows=N loops=N)
- Workers Planned: 1
+ Workers Planned: 2
Workers Launched: N
-> Partial Aggregate (actual rows=N loops=N)
- -> Nested Loop (actual rows=N loops=N)
- -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
- Filter: (a = ANY ('{1,0,0}'::integer[]))
- Rows Removed by Filter: N
- -> Append (actual rows=N loops=N)
+ -> Parallel Append (actual rows=N loops=N)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
-> Index Scan using ab_a1_b1_a_idx on ab_a1_b1 ab_1 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
-> Index Scan using ab_a1_b2_a_idx on ab_a1_b2 ab_2 (actual rows=N loops=N)
Index Cond: (a = a.a)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
-> Index Scan using ab_a1_b3_a_idx on ab_a1_b3 ab_3 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (actual rows=N loops=N)
Index Cond: (a = a.a)
-(28 rows)
+(60 rows)
delete from lprt_a where a = 1;
select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
- explain_parallel_append
--------------------------------------------------------------------------------------------------
+ explain_parallel_append
+--------------------------------------------------------------------------------------------------------
Finalize Aggregate (actual rows=N loops=N)
-> Gather (actual rows=N loops=N)
- Workers Planned: 1
+ Workers Planned: 2
Workers Launched: N
-> Partial Aggregate (actual rows=N loops=N)
- -> Nested Loop (actual rows=N loops=N)
- -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
- Filter: (a = ANY ('{1,0,0}'::integer[]))
- Rows Removed by Filter: N
- -> Append (actual rows=N loops=N)
- -> Index Scan using ab_a1_b1_a_idx on ab_a1_b1 ab_1 (never executed)
+ -> Parallel Append (actual rows=N loops=N)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a1_b1_a_idx on ab_a1_b1 ab_1 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a1_b2_a_idx on ab_a1_b2 ab_2 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a1_b2_a_idx on ab_a1_b2 ab_2 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a1_b3_a_idx on ab_a1_b3 ab_3 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a1_b3_a_idx on ab_a1_b3 ab_3 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a2_b1_a_idx on ab_a2_b1 ab_4 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a2_b2_a_idx on ab_a2_b2 ab_5 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a2_b3_a_idx on ab_a2_b3 ab_6 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a3_b1_a_idx on ab_a3_b1 ab_7 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a3_b2_a_idx on ab_a3_b2 ab_8 (actual rows=N loops=N)
Index Cond: (a = a.a)
- -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (never executed)
+ -> Nested Loop (actual rows=N loops=N)
+ -> Parallel Seq Scan on lprt_a a (actual rows=N loops=N)
+ Filter: (a = ANY ('{1,0,0}'::integer[]))
+ Rows Removed by Filter: N
+ -> Index Scan using ab_a3_b3_a_idx on ab_a3_b3 ab_9 (actual rows=N loops=N)
Index Cond: (a = a.a)
-(28 rows)
+(60 rows)
reset enable_hashjoin;
reset enable_mergejoin;
@@ -2998,43 +3134,63 @@ explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 > tprt.col1;
QUERY PLAN
--------------------------------------------------------------------------
- Nested Loop (actual rows=6 loops=1)
- -> Seq Scan on tbl1 (actual rows=2 loops=1)
- -> Append (actual rows=3 loops=2)
+ Append (actual rows=6 loops=1)
+ -> Nested Loop (actual rows=4 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
-> Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=2)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt2_idx on tprt_2 (actual rows=2 loops=1)
+ -> Nested Loop (actual rows=2 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt3_idx on tprt_3 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=2)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt4_idx on tprt_4 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt4_idx on tprt_4 (actual rows=0 loops=2)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt5_idx on tprt_5 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt5_idx on tprt_5 (actual rows=0 loops=2)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt6_idx on tprt_6 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt6_idx on tprt_6 (actual rows=0 loops=2)
Index Cond: (col1 < tbl1.col1)
-(15 rows)
+(25 rows)
explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
QUERY PLAN
--------------------------------------------------------------------------
- Nested Loop (actual rows=2 loops=1)
- -> Seq Scan on tbl1 (actual rows=2 loops=1)
- -> Append (actual rows=1 loops=2)
- -> Index Scan using tprt1_idx on tprt_1 (never executed)
+ Append (actual rows=2 loops=1)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt1_idx on tprt_1 (actual rows=0 loops=2)
Index Cond: (col1 = tbl1.col1)
+ -> Nested Loop (actual rows=2 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
-> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt3_idx on tprt_3 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=2)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt4_idx on tprt_4 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt4_idx on tprt_4 (actual rows=0 loops=2)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt5_idx on tprt_5 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt5_idx on tprt_5 (actual rows=0 loops=2)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt6_idx on tprt_6 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=2 loops=1)
+ -> Index Scan using tprt6_idx on tprt_6 (actual rows=0 loops=2)
Index Cond: (col1 = tbl1.col1)
-(15 rows)
+(25 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 > tprt.col1
@@ -3064,43 +3220,63 @@ explain (analyze, costs off, summary off, timing off)
select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
QUERY PLAN
--------------------------------------------------------------------------
- Nested Loop (actual rows=23 loops=1)
- -> Seq Scan on tbl1 (actual rows=5 loops=1)
- -> Append (actual rows=5 loops=5)
+ Append (actual rows=23 loops=1)
+ -> Nested Loop (actual rows=10 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
-> Index Scan using tprt1_idx on tprt_1 (actual rows=2 loops=5)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt2_idx on tprt_2 (actual rows=3 loops=4)
+ -> Nested Loop (actual rows=11 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt2_idx on tprt_2 (actual rows=2 loops=5)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt3_idx on tprt_3 (actual rows=1 loops=2)
+ -> Nested Loop (actual rows=2 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=5)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt4_idx on tprt_4 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt4_idx on tprt_4 (actual rows=0 loops=5)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt5_idx on tprt_5 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt5_idx on tprt_5 (actual rows=0 loops=5)
Index Cond: (col1 < tbl1.col1)
- -> Index Scan using tprt6_idx on tprt_6 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt6_idx on tprt_6 (actual rows=0 loops=5)
Index Cond: (col1 < tbl1.col1)
-(15 rows)
+(25 rows)
explain (analyze, costs off, summary off, timing off)
select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
QUERY PLAN
--------------------------------------------------------------------------
- Nested Loop (actual rows=3 loops=1)
- -> Seq Scan on tbl1 (actual rows=5 loops=1)
- -> Append (actual rows=1 loops=5)
- -> Index Scan using tprt1_idx on tprt_1 (never executed)
+ Append (actual rows=3 loops=1)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt1_idx on tprt_1 (actual rows=0 loops=5)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt2_idx on tprt_2 (actual rows=1 loops=2)
+ -> Nested Loop (actual rows=2 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt2_idx on tprt_2 (actual rows=0 loops=5)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=3)
+ -> Nested Loop (actual rows=1 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=5)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt4_idx on tprt_4 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt4_idx on tprt_4 (actual rows=0 loops=5)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt5_idx on tprt_5 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt5_idx on tprt_5 (actual rows=0 loops=5)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt6_idx on tprt_6 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=5 loops=1)
+ -> Index Scan using tprt6_idx on tprt_6 (actual rows=0 loops=5)
Index Cond: (col1 = tbl1.col1)
-(15 rows)
+(25 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 > tprt.col1
@@ -3149,22 +3325,32 @@ explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 < tprt.col1;
QUERY PLAN
--------------------------------------------------------------------------
- Nested Loop (actual rows=1 loops=1)
- -> Seq Scan on tbl1 (actual rows=1 loops=1)
- -> Append (actual rows=1 loops=1)
- -> Index Scan using tprt1_idx on tprt_1 (never executed)
+ Append (actual rows=1 loops=1)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt1_idx on tprt_1 (actual rows=0 loops=1)
Index Cond: (col1 > tbl1.col1)
- -> Index Scan using tprt2_idx on tprt_2 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt2_idx on tprt_2 (actual rows=0 loops=1)
Index Cond: (col1 > tbl1.col1)
- -> Index Scan using tprt3_idx on tprt_3 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=1)
Index Cond: (col1 > tbl1.col1)
- -> Index Scan using tprt4_idx on tprt_4 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt4_idx on tprt_4 (actual rows=0 loops=1)
Index Cond: (col1 > tbl1.col1)
- -> Index Scan using tprt5_idx on tprt_5 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt5_idx on tprt_5 (actual rows=0 loops=1)
Index Cond: (col1 > tbl1.col1)
+ -> Nested Loop (actual rows=1 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
-> Index Scan using tprt6_idx on tprt_6 (actual rows=1 loops=1)
Index Cond: (col1 > tbl1.col1)
-(15 rows)
+(25 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 < tprt.col1
@@ -3179,24 +3365,34 @@ delete from tbl1;
insert into tbl1 values (10000);
explain (analyze, costs off, summary off, timing off)
select * from tbl1 join tprt on tbl1.col1 = tprt.col1;
- QUERY PLAN
--------------------------------------------------------------------
- Nested Loop (actual rows=0 loops=1)
- -> Seq Scan on tbl1 (actual rows=1 loops=1)
- -> Append (actual rows=0 loops=1)
- -> Index Scan using tprt1_idx on tprt_1 (never executed)
+ QUERY PLAN
+--------------------------------------------------------------------------
+ Append (actual rows=0 loops=1)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt1_idx on tprt_1 (actual rows=0 loops=1)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt2_idx on tprt_2 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt2_idx on tprt_2 (actual rows=0 loops=1)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt3_idx on tprt_3 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt3_idx on tprt_3 (actual rows=0 loops=1)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt4_idx on tprt_4 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt4_idx on tprt_4 (actual rows=0 loops=1)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt5_idx on tprt_5 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt5_idx on tprt_5 (actual rows=0 loops=1)
Index Cond: (col1 = tbl1.col1)
- -> Index Scan using tprt6_idx on tprt_6 (never executed)
+ -> Nested Loop (actual rows=0 loops=1)
+ -> Seq Scan on tbl1 (actual rows=1 loops=1)
+ -> Index Scan using tprt6_idx on tprt_6 (actual rows=0 loops=1)
Index Cond: (col1 = tbl1.col1)
-(15 rows)
+(25 rows)
select tbl1.col1, tprt.col1 from tbl1
inner join tprt on tbl1.col1 = tprt.col1
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 2f3eb4e7f1..7922f56861 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -134,6 +134,7 @@ select count(*) = 0 as ok from pg_stat_wal_receiver;
select name, setting from pg_settings where name like 'enable%';
name | setting
--------------------------------+---------
+ enable_asymmetric_join | on
enable_async_append | on
enable_bitmapscan | on
enable_gathermerge | on
@@ -157,7 +158,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(23 rows)
+(24 rows)
-- There are always wait event descriptions for various types.
select type, count(*) > 0 as ok FROM pg_wait_events
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 128ce8376e..4b5bb30f57 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -585,6 +585,271 @@ EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a = t2.b) WHERE t1.b = 0 ORDER BY t1.a, t2.b;
+--
+-- For asymmetric partition-wise join
+--
+CREATE TABLE prt5 (hkey int, a int, b int) PARTITION BY HASH(hkey);
+CREATE TABLE prt5_p0 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 0);
+CREATE TABLE prt5_p1 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 1);
+CREATE TABLE prt5_p2 PARTITION OF prt5
+ FOR VALUES WITH (modulus 3, remainder 2);
+CREATE TABLE prt6 (aid int, alabel text) PARTITION BY HASH(aid);
+CREATE TABLE prt6_p0 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prt6_p1 PARTITION OF prt6
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE t5_1 (aid int, alabel text);
+CREATE TABLE t5_2 (bid int, blabel text);
+
+INSERT INTO prt5 (SELECT x, (x % 1000)::int,
+ ((x+1) % 1000)::int
+ FROM generate_series(1,1000000) x);
+INSERT INTO t5_1 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO t5_2 (SELECT x, md5(x::text) FROM generate_series(-200, 1500) x);
+INSERT INTO prt6 (SELECT * FROM t5_1);
+
+VACUUM ANALYZE prt5,prt6,t5_1,t5_2;
+
+SET max_parallel_workers_per_gather = 0;
+
+-- Trivial asymmetric JOIN of partitioned table with a relation
+EXPLAIN (COSTS OFF)
+SELECT *
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- AJ: the case of WholeRowVar
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1 FROM prt1 t1, prt2_p1 t2 WHERE (t1.a = t2.b);
+
+-- AJ with partitioned table (one partition has pruned)
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1
+FROM prt1 t1, prt2_p1 t2
+WHERE (t1.a = t2.b) AND (t1.a = 542 OR t1.a = 1);
+
+-- Test AJ on anti join
+-- First test mostly for the future SJE with support of partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt1 t1, prt2_p1 t2
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.a);
+SELECT count(*) FROM prt1 t1, prt2_p1 t2
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.a);
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt2_p1 t1
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.b);
+SELECT count(*) FROM prt2_p1 t1
+WHERE NOT EXISTS (SELECT * FROM prt1 WHERE prt1.a = t1.b);
+
+-- Can't use AJ because of complex inner as well as PWJ (one partition pruned)
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1
+FROM prt1 t1, prt2_p1 t2, prt1 t3
+WHERE (t1.a = t2.b) AND (t1.a = 542 OR t1.a = 1) AND t2.a = t3.a;
+
+-- TODO: We don't see any AJ or PWJ, but may be it's a game of costs?
+EXPLAIN (COSTS OFF)
+SELECT (t1.*)::prt1
+FROM prt1 t1, prt2_p1 t2, prt1 t3
+WHERE (t1.a = t2.b) AND (t1.a = 542 OR t1.a = 1)
+ AND (t2.a = t3.a) AND (t3.a = 543 OR t3.a = 2);
+
+-- The same, but appended with UNION ALL
+EXPLAIN (COSTS OFF)
+SELECT * FROM (
+ (SELECT * FROM prt5_p0)
+ UNION ALL
+ (SELECT * FROM prt5_p1)
+ UNION ALL
+ (SELECT * FROM prt5_p2)
+ ) AS sq1
+JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- Don't allow asymmetric JOIN of two partitioned tables.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN prt6 ON a = aid AND alabel like '%abc%';
+
+-- Check that asymmetric JOIN with Subquery is forbidden
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM prt5 JOIN (
+ SELECT * FROM prt6 LIMIT 1000
+) AS sq1 ON a = aid AND alabel like '%abc%';
+SELECT count(*) FROM prt5 JOIN (SELECT * FROM prt6 LIMIT 1000) AS sq1
+ ON a = aid AND alabel like '%abc%';
+
+-- Asymmetric JOIN of two plane tables and one partitioned
+EXPLAIN (COSTS OFF)
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+SELECT count(*)
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+-- unable to extract non-partitioned right relation
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 RIGHT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+-- left side can be extracted, but no cost benefit
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt5 LEFT JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+-- validation of the results with/without asymmetric partition-wise join
+SELECT * INTO pg_temp.result01a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02a
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SET enable_partitionwise_join = off;
+
+SELECT * INTO pg_temp.result01b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%abc%';
+
+SELECT * INTO pg_temp.result02b
+ FROM prt5 JOIN t5_1 ON a = aid AND alabel like '%ab%'
+ JOIN t5_2 ON b = bid AND blabel like '%cd%';
+
+SELECT * FROM pg_temp.result01a EXCEPT SELECT * FROM pg_temp.result01b;
+SELECT * FROM pg_temp.result02a EXCEPT SELECT * FROM pg_temp.result02b;
+
+SET enable_partitionwise_join = on;
+
+-- Trying different JOIN combinations optimiser can provide partitioned relation
+-- buried deeply in the inner subtree. Check that it doesn't provide
+-- inconsistency in the plan.
+EXPLAIN (COSTS OFF)
+SELECT * from prt1 d1, prt2 d2, unnest(array[3,4]) n, unnest(array[3,4]) g
+WHERE d1.a = n AND d2.b = d1.a and g = n;
+
+-- TODO:
+-- According to current logic decision on AJ or PWJ for specific joinrel
+-- can depend on the order of relations in the FROM list. See how it works.
+-- Can we resolve this issue somehow?
+EXPLAIN (COSTS OFF) -- PWJ on top
+SELECT * from prt1 d1, unnest(array[3,4]) n, prt2 d2
+WHERE d1.a = n AND d2.b = d1.a;
+EXPLAIN (COSTS OFF) -- AJ on top
+SELECT * from prt1 d1, prt2 d2, unnest(array[3,4]) n
+WHERE d1.a = n AND d2.b = d1.a;
+
+-- Allow JOINs in the inner side of AJ
+EXPLAIN (COSTS OFF)
+SELECT * FROM prt1 d1 LEFT JOIN (unnest(array[3,4]) n
+ FULL OUTER JOIN unnest(array[3,4]) g ON (n=g))
+ON (d1.a = n);
+
+-- Check reparameterization code when an optimizer have to make two level relids
+-- adjustment.
+
+SET enable_hashjoin = 'off';
+SET enable_mergejoin = 'off';
+SET enable_material = 'off';
+
+CREATE TABLE big AS SELECT x AS x FROM generate_series(1,1E4) x;
+CREATE INDEX ON big(x);
+CREATE TABLE small AS SELECT x, -x AS y FROM generate_series(1,100) x;
+
+CREATE TABLE part_l0 (x int, y int, z int) PARTITION BY HASH (y);
+CREATE TABLE part0_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 0) PARTITION BY HASH (z);
+CREATE TABLE part0_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE part1_l2 PARTITION OF part0_l1 (z)
+ FOR VALUES WITH (modulus 2, remainder 1);
+CREATE TABLE part1_l1 PARTITION OF part_l0 (y)
+ FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO part_l0 (x, y, z) (SELECT x,x,x FROM generate_series(1,1E4) x);
+
+ANALYZE big,small,part_l0;
+
+-- Parameter have to be reparameterized by a plane relation.
+EXPLAIN (COSTS OFF)
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+SELECT small.* FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x;
+
+-- Parameters have to be reparameterized by plane and partitioned relations.
+EXPLAIN (COSTS OFF)
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+SELECT count(*) FROM small
+ JOIN part_l0 ON small.y = part_l0.y AND small.y + part_l0.y < 10
+ LEFT JOIN big ON big.x = small.x OR big.x = part_l0.x;
+
+DROP TABLE IF EXISTS big,small,part_l0 CASCADE;
+RESET enable_hashjoin;
+RESET enable_mergejoin;
+RESET enable_material;
+RESET max_parallel_workers_per_gather;
+
+--
+-- The case of AJ under partitionwise PNL which needs reparameterization
+--
+CREATE TABLE small (x int);
+CREATE TABLE big (x int, y int);
+CREATE TABLE part42 (x int, y int) PARTITION BY RANGE (x);
+CREATE TABLE p1 PARTITION OF part42 (x) FOR VALUES FROM (0) TO (50);
+CREATE TABLE p2 PARTITION OF part42 (x) FOR VALUES FROM (50) TO (101);
+INSERT INTO big (SELECT gs%10,gs FROM generate_series(1,10000) AS gs);
+INSERT INTO small (x) SELECT gs FROM generate_series (1,1) AS gs;
+INSERT INTO part42 SELECT gs%100, gs%100 FROM generate_series(1,100000) AS gs;
+CREATE INDEX big_idx ON big(y);
+VACUUM ANALYZE small,big,part42;
+
+SET enable_hashjoin = off;
+SET enable_mergejoin = off;
+SET max_parallel_workers_per_gather = 0; -- Reduce test unstability
+EXPLAIN (COSTS OFF)
+SELECT * FROM part42 ext LEFT JOIN big
+ JOIN part42 p2 ON (p2.x=big.y)
+ON (ext.x=p2.x AND ext.y=big.y);
+RESET max_parallel_workers_per_gather;
+RESET enable_mergejoin;
+RESET enable_hashjoin;
+
+-- Parameterized path examples.
+CREATE TABLE prta (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prta1 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prta2 PARTITION OF prta FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prta1 (id);
+CREATE UNIQUE INDEX ON prta2 (id);
+INSERT INTO prta (id, payload)
+ (SELECT *, ('abc' || id)::text AS payload
+ FROM generate_series(1,1) AS id);
+
+CREATE TABLE prtb (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE prtb1 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE prtb2 PARTITION OF prtb FOR VALUES WITH (modulus 2, remainder 1);
+CREATE UNIQUE INDEX ON prtb1 (id);
+CREATE UNIQUE INDEX ON prtb2 (id);
+INSERT INTO prtb (id, payload)
+ (SELECT *, ('def' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+
+CREATE TABLE e (id integer, payload text) PARTITION BY HASH (id);
+CREATE TABLE e1 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 0);
+CREATE TABLE e2 PARTITION OF e FOR VALUES WITH (modulus 2, remainder 1);
+INSERT INTO e (id, payload)
+ (SELECT *, ('ghi' || id)::text AS payload
+ FROM generate_series(1,1000) AS id);
+CREATE UNIQUE INDEX ON e1 (id);
+CREATE UNIQUE INDEX ON e2 (id);
+
+ANALYZE prta,prtb,e;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb WHERE prta.id=prtb.id;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM prta,prtb,e WHERE prta.id=prtb.id AND prta.id=e.id;
+
-- semi join
EXPLAIN (COSTS OFF)
SELECT t1.* FROM prt1_adv t1 WHERE EXISTS (SELECT 1 FROM prt2_adv t2 WHERE t1.a = t2.b) AND t1.b = 0 ORDER BY t1.a;
@@ -777,6 +1042,17 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_adv t1 INNER JOIN prt2_adv t2 ON (t1.a =
DROP TABLE prt1_adv;
DROP TABLE prt2_adv;
+DROP TABLE prt1;
+
+-- Test function scan with lateral relids
+CREATE TABLE prt1 (id int, data json) PARTITION BY HASH(id);
+CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES WITH (MODULUS 3, REMAINDER 0);
+CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES WITH (MODULUS 3, REMAINDER 1);
+CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES WITH (MODULUS 3, REMAINDER 2);
+INSERT INTO prt1 SELECT i, format('{ "name": "value%s", "Test": {"a":"b"}}', i)::json from generate_series(1,100) i;
+ANALYZE prt1;
+SELECT id, key, value from prt1, json_each(data->'Test') ORDER BY id LIMIT 10;
+DROP TABLE prt1;
-- Tests for list-partitioned tables
CREATE TABLE plt1_adv (a int, b int, c text) PARTITION BY LIST (c);
--
2.45.0
Hi!
On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
On 15/10/2023 13:25, Alexander Korotkov wrote:
Great! I'm looking forward to the revised patch.
Revising the code and opinions before restarting this work, I found two
different possible strategies mentioned in the thread:
1. 'Common Resources' shares the materialised result of the inner table
scan (a hash table in the case of HashJoin) to join each partition one
by one. It gives us a profit in the case of parallel append and possibly
other cases, like the one shown in the initial message.
2. 'Individual strategies' - By limiting the AJ feature to cases when
the JOIN clause contains a partitioning expression, we can push an
additional scan clause into each copy of the inner table scan, reduce
the number of tuples scanned, and even prune something because of proven
zero input.I see the pros and cons of both approaches. The first option is more
straightforward, and its outcome is obvious in the case of parallel
append. But how can we guarantee the same join type for each join? Why
should we ignore the positive effect of different strategies for
different partitions?
The second strategy is more expensive for the optimiser, especially in
the multipartition case. But as I can predict, it is easier to implement
and looks more natural for the architecture. What do you think about that?
Actually, the idea I tried to express is the combination of #1 and #2:
to build individual plan for every partition, but consider the 'Common
Resources'. Let me explain this a bit more.
Right now, we basically we consider the following properties during
selection of paths.
1) Cost. The cheaper path wins. There a two criteria though: startup
cost and total cost. So, we can keep both paths with cheaper startup
costs and paths with cheaper total cost.
2) Pathkeys. We can keep a path with more expensive path, which has
pathkeys potentially useful in future.
My idea is to introduce a new property for paths selection.
3) Usage of common resources. The common resource can be: hash
representation of relation, memoize over relation scan, etc. We can
exclude the cost of common resource generation from the path cost, but
keep the reference for the common resource with its generation cost.
If one path uses more common resources than another path, it could
cost-dominate another one only if its cheaper together with its extra
common resources cost. If one path uses less or equal common
resources than another, it could normally cost-dominate another one.
Using these rules, we can gather the the plurality of paths for each
child join taking common resources into account. After that we can
apply some global optimization finding generation of which common
resources can reduce the global cost.
However, I understand this is huge amount of work given we have to
introduce new basic optimizer concepts. I get that the main
application of this patch is sharding. If we have global tables
residing each shard, we can push down any joins with them. Given this
patch gives some optimization for non-sharded case, I think we
*probably* can accept its concept even that it this optimization is
obviously not perfect.
------
Regards,
Alexander Korotkov
Supabase
On Sun, May 5, 2024 at 5:55 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 18/10/2023 16:59, Ashutosh Bapat wrote:
On Wed, Oct 18, 2023 at 10:55 AM Andrei Lepikhov
The relid is also used to track the scans at executor level. Since we
have so many scans on A, each may be using different plan, we will
need different ids for those.I don't understand this sentence. Which way executor uses this index of
RelOptInfo ?See Scan::scanrelid
Hi,
In the attachment, you will find a fresh version of the patch.
I've analysed the danger of the same RelOptInfo index for the executor.
In the examples I found (scared), it is still not a problem because
ExecQual() does all the jobs at one operation and doesn't intersect with
over operations. Of course, it is not a good design, and we will work on
this issue. But at least this code can be used in experiments.
Furthermore, I've shared some reflections on this feature. To avoid
cluttering the thread, I've published them in [1]. These thoughts
provide additional context and considerations for our ongoing work.[1]
https://danolivo.substack.com/p/postgresql-asymmetric-join-technique?r=34q1yy
I've rebased the patch to the current master. Also, I didn't like the
needFlatCopy argument to reparameterize_path_by_child(). It looks
quite awkward. Instead, as soon as we need to copy paths, I've
enabled native copy of paths. Now, we can do just copyObject() over
path in caller. Looks much cleaner for me. What do you think?
Other notes:
1) I think we need to cover the cases, which
is_inner_rel_safe_for_asymmetric_join() filters out, by regression
tests.
2) is_asymmetric_join() looks awkward for me. Should we instead make
a flag in JoinPath?
3) I understand that you have re-use RelOptInfo multiple times. It's
too late stage of query processing to add a simple relation into
planner structs. I tried rescans issued by cursors, EvalPlanQual()
caused by concurrent updates, but didn't manage to break this. It
seems that even if same relation index is used multiple times in
different places of a query, it never get used simultaneously. But
even if this somehow is OK, this is significant change of assumptions
in planner/executor data structures. Perhaps, we need at least Tom's
opinion on this.
------
Regards,
Alexander Korotkov
Supabase
On 1/8/2024 20:56, Alexander Korotkov wrote:
On Tue, Apr 2, 2024 at 6:07 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
Actually, the idea I tried to express is the combination of #1 and #2:
to build individual plan for every partition, but consider the 'Common
Resources'. Let me explain this a bit more.
Thanks for keeping your eye on it!
My idea is to introduce a new property for paths selection.
3) Usage of common resources. The common resource can be: hash
representation of relation, memoize over relation scan, etc. We can
exclude the cost of common resource generation from the path cost, but
keep the reference for the common resource with its generation cost.
If one path uses more common resources than another path, it could
cost-dominate another one only if its cheaper together with its extra
common resources cost. If one path uses less or equal common
resources than another, it could normally cost-dominate another one.
The most challenging part for me is the cost calculation, which is
bonded with estimations of other paths. To correctly estimate the
effect, we need to remember at least the whole number of paths sharing
resources.
Also, I wonder if it can cause some corner cases where prediction error
on a shared resource will cause an even worse situation upstream.
I think we could push off here from an example and a counter-example,
but I still can't find them.
However, I understand this is huge amount of work given we have to
introduce new basic optimizer concepts. I get that the main
application of this patch is sharding. If we have global tables
residing each shard, we can push down any joins with them. Given this
patch gives some optimization for non-sharded case, I think we
*probably* can accept its concept even that it this optimization is
obviously not perfect.
Yes, right now sharding is the most profitable case. We can push down
parts of the plan which references only some common resources:
FunctionScan, ValueScan, tables which can be proved are existed
everywhere and provide the same output. But for now it is too far from
the core code, IMO. - So, I search for cases that can be helpful for a
single instance.
--
regards,
Andrei Lepikhov
Postgres Professional