Asymmetric partition-wise JOIN

Started by KaiGai Koheiover 6 years ago44 messageshackers
Jump to latest
#1KaiGai Kohei
kaigai@ak.jp.nec.com

Hello,

PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.

See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.

create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);

create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;

postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
-> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Append (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
-> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
-> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
Planning Time: 0.286 ms
Execution Time: 2333.264 ms
(12 rows)

We can manually rewrite this query as follows:

postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;

Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)

postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
-> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
-> Hash Join (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
-> Hash Join (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
Hash Cond: (p_1.fkey = f_1.pkey)
-> Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
-> Hash Join (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
Hash Cond: (p_2.fkey = f_2.pkey)
-> Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
Planning Time: 0.614 ms
Execution Time: 1396.131 ms
(25 rows)

How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.

Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

#2KaiGai Kohei
kaigai@ak.jp.nec.com
In reply to: KaiGai Kohei (#1)
Re: Asymmetric partition-wise JOIN

Hello,

Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.

See the example below:

create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,1000000) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;

ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.

postgres=# explain select * from ptable, t1 where a = aid;
QUERY PLAN
------------------------------------------------------------------------------
Hash Join (cost=2.12..24658.62 rows=49950 width=49)
Hash Cond: (ptable_p0.a = t1.aid)
-> Append (cost=0.00..20407.00 rows=1000000 width=12)
-> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12)
-> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12)
-> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
(8 rows)

The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.

postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
QUERY PLAN
------------------------------------------------------------------------------
Append (cost=2.12..19912.62 rows=49950 width=49)
-> Hash Join (cost=2.12..6552.96 rows=16647 width=49)
Hash Cond: (ptable_p0.a = t1.aid)
-> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
-> Hash Join (cost=2.12..6557.29 rows=16658 width=49)
Hash Cond: (ptable_p1.a = t1.aid)
-> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
-> Hash Join (cost=2.12..6552.62 rows=16645 width=49)
Hash Cond: (ptable_p2.a = t1.aid)
-> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
(16 rows)

We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

How about your thought for this feature?

Best regards,

2019年8月12日(月) 15:03 Kohei KaiGai <kaigai@heterodb.com>:

Hello,

PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.

See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.

create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);

create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;

postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
-> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Append (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
-> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
-> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
Planning Time: 0.286 ms
Execution Time: 2333.264 ms
(12 rows)

We can manually rewrite this query as follows:

postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;

Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)

postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
-> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
-> Hash Join (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
-> Hash Join (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
Hash Cond: (p_1.fkey = f_1.pkey)
-> Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
-> Hash Join (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
Hash Cond: (p_2.fkey = f_2.pkey)
-> Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
Planning Time: 0.614 ms
Execution Time: 1396.131 ms
(25 rows)

How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.

Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

Attachments:

pgsql13-asymmetric-partitionwise-join.v1.patchapplication/octet-stream; name=pgsql13-asymmetric-partitionwise-join.v1.patchDownload+376-21
#3Thomas Munro
thomas.munro@gmail.com
In reply to: KaiGai Kohei (#2)
Re: Asymmetric partition-wise JOIN

On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:

We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1? So you get:

ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.

Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.

--
Thomas Munro
https://enterprisedb.com

#4KaiGai Kohei
kaigai@ak.jp.nec.com
In reply to: Thomas Munro (#3)
Re: Asymmetric partition-wise JOIN

2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>:

On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:

We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1? So you get:

ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.

Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

It requires the join-key must include the partition key and also must be
equality-join, doesn't it?
If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
the partition bounds, indeed.

In case when some of partition leafs are pruned, it is exactly beneficial
because relevant rows to be referenced by the pruned child relations
are waste of memory.

On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling join-nodess.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.

Hmm. Do you intend the inner-path may have different behavior according
to the partition bounds definition where the outer-path to be joined?
Let me investigate its pros & cons.

The reasons why I think the idea of sharing the same hash table is reasonable
in this scenario are:
1. We can easily extend the idea for parallel optimization. A hash table on DSM
segment, once built, can be shared by all the siblings in all the
parallel workers.
2. We can save the memory consumption regardless of the join-keys and
partition-keys, even if these are not involved in the query.

On the other hands, below are the downside. Potentially, combined use of
your idea may help these cases:
3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
4. Hash table contains rows to be referenced by only pruned partition leafs.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

#5Michael Paquier
michael@paquier.xyz
In reply to: KaiGai Kohei (#4)
Re: Asymmetric partition-wise JOIN

On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:

On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling
join-nodess.

The patch crashes when running the regression tests, per the report of
the automatic patch tester. Could you look at that? I have moved the
patch to nexf CF, waiting on author.
--
Michael

#6KaiGai Kohei
kaigai@ak.jp.nec.com
In reply to: Michael Paquier (#5)
Re: Asymmetric partition-wise JOIN

Hello,

This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.

The attached v2 fixed the problem, and regression test finished correctly.

Best regards,

2019年12月1日(日) 12:24 Michael Paquier <michael@paquier.xyz>:

On Sat, Aug 24, 2019 at 05:33:01PM +0900, Kohei KaiGai wrote:

On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling
join-nodess.

The patch crashes when running the regression tests, per the report of
the automatic patch tester. Could you look at that? I have moved the
patch to nexf CF, waiting on author.
--
Michael

--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

Attachments:

gdb_back_trace.txttext/plain; charset=US-ASCII; name=gdb_back_trace.txtDownload
pgsql13-asymmetric-partitionwise-join.v2.patchapplication/octet-stream; name=pgsql13-asymmetric-partitionwise-join.v2.patchDownload+382-21
#7David Steele
david@pgmasters.net
In reply to: KaiGai Kohei (#6)
Re: Asymmetric partition-wise JOIN

Hi Thomas,

On 12/27/19 2:34 AM, Kohei KaiGai wrote:

This crash was reproduced on our environment also.
It looks to me adjust_child_relids_multilevel() didn't expect a case
when supplied 'relids'
(partially) indicate normal and non-partitioned relation.
It tries to build a new 'parent_relids' that is a set of
appinfo->parent_relid related to the
supplied 'child_relids'. However, bits in child_relids that indicate
normal relations are
unintentionally dropped here. Then, adjust_child_relids_multilevel()
goes to an infinite
recursion until stack limitation.

The attached v2 fixed the problem, and regression test finished correctly.

Any thoughts on the new version of this patch?

Regards,
--
-David
david@pgmasters.net

#8Daniel Gustafsson
daniel@yesql.se
In reply to: KaiGai Kohei (#6)
Re: Asymmetric partition-wise JOIN

On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:

The attached v2 fixed the problem, and regression test finished correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

cheers ./daniel

#9Andrei Lepikhov
lepihov@gmail.com
In reply to: KaiGai Kohei (#6)
Re: Asymmetric partition-wise JOIN

On 12/27/19 12:34 PM, Kohei KaiGai wrote:

The attached v2 fixed the problem, and regression test finished correctly.

Using your patch I saw incorrect value of predicted rows at the top node
of the plan: "Append (cost=270.02..35165.37 rows=40004 width=16)"
Full explain of the query plan see in attachment -
explain_with_asymmetric.sql

if I disable enable_partitionwise_join then:
"Hash Join (cost=270.02..38855.25 rows=10001 width=16)"
Full explain - explain_no_asymmetric.sql

I thought that is the case of incorrect usage of cached values of
norm_selec, but it is a corner-case problem of the eqjoinsel() routine :

selectivity = 1/size_of_larger_relation; (selfuncs.c:2567)
tuples = selectivity * outer_tuples * inner_tuples; (costsize.c:4607)

i.e. number of tuples depends only on size of smaller relation.
It is not a bug of your patch but I think you need to know because it
may affect on planner decision.

===
P.S. Test case:
CREATE TABLE t0 (a serial, b int);
INSERT INTO t0 (b) (SELECT * FROM generate_series(1e4, 2e4) as g);
CREATE TABLE parts (a serial, b int) PARTITION BY HASH(a)
INSERT INTO parts (b) (SELECT * FROM generate_series(1, 1e6) as g);

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

explain_with_asymmetric.sqlapplication/sql; name=explain_with_asymmetric.sqlDownload
explain_no_asymmetric.sqlapplication/sql; name=explain_no_asymmetric.sqlDownload
#10Andrei Lepikhov
lepihov@gmail.com
In reply to: Daniel Gustafsson (#8)
Re: Asymmetric partition-wise JOIN

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:

The attached v2 fixed the problem, and regression test finished correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we
don't break partitionwise join of a relation with foreign partition and
a local relation if we have info that remote server has foreign table
link to the local relation (by analogy with shippable extensions).

Maybe mark as 'Needs review'?

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

pgsql13-asymmetric-partitionwise-join.v3.patchtext/x-patch; charset=UTF-8; name=pgsql13-asymmetric-partitionwise-join.v3.patchDownload+385-22
#11Daniel Gustafsson
daniel@yesql.se
In reply to: Andrei Lepikhov (#10)
Re: Asymmetric partition-wise JOIN

On 21 Aug 2020, at 08:02, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote:

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:
The attached v2 fixed the problem, and regression test finished correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we don't break partitionwise join of a relation with foreign partition and a local relation if we have info that remote server has foreign table link to the local relation (by analogy with shippable extensions).

Maybe mark as 'Needs review'?

Thanks for the rebase, I've updated the commitfest entry to reflect that it
needs a round of review.

cheers ./daniel

#12Amul Sul
sulamul@gmail.com
In reply to: KaiGai Kohei (#4)
Re: Asymmetric partition-wise JOIN

On Sat, Aug 24, 2019 at 2:03 PM Kohei KaiGai <kaigai@heterodb.com> wrote:

2019年8月24日(土) 7:02 Thomas Munro <thomas.munro@gmail.com>:

On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:

We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.

On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1? So you get:

ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.

Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

It requires the join-key must include the partition key and also must be
equality-join, doesn't it?
If ptable and t1 are joined using ptable.dist = t1.foo, we can distribute
t1 for each leaf table with "WHERE hash(foo)%4 = xxx" according to
the partition bounds, indeed.

In case when some of partition leafs are pruned, it is exactly beneficial
because relevant rows to be referenced by the pruned child relations
are waste of memory.

On the other hands, it eventually consumes almost equivalent amount
of memory to load the inner relations, if no leafs are pruned, and if we
could extend the Hash-node to share the hash-table with sibling join-nodess.

One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.

Hmm. Do you intend the inner-path may have different behavior according
to the partition bounds definition where the outer-path to be joined?
Let me investigate its pros & cons.

The reasons why I think the idea of sharing the same hash table is reasonable
in this scenario are:
1. We can easily extend the idea for parallel optimization. A hash table on DSM
segment, once built, can be shared by all the siblings in all the
parallel workers.
2. We can save the memory consumption regardless of the join-keys and
partition-keys, even if these are not involved in the query.

On the other hands, below are the downside. Potentially, combined use of
your idea may help these cases:
3. Distributed inner-relation cannot be outer side of XXX OUTER JOIN.
4. Hash table contains rows to be referenced by only pruned partition leafs.

+ many, for the sharable hash of the inner table of the join. IMHO,
this could be the most interesting and captivating thing about this feature.
But might be a complicated piece, is that still on the plan?

Regards,
Amul

#13Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Andrei Lepikhov (#10)
Re: Asymmetric partition-wise JOIN

On 21.08.2020 09:02, Andrey V. Lepikhov wrote:

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:

The attached v2 fixed the problem, and regression test finished
correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we
don't break partitionwise join of a relation with foreign partition
and a local relation if we have info that remote server has foreign
table link to the local relation (by analogy with shippable extensions).

Maybe mark as 'Needs review'?

Status update for a commitfest entry.

According to cfbot, the patch fails to apply. Could you please send a
rebased version?

This thread was inactive for quite some time. Is anyone going to
continue working on it?

I see some interest in the idea of sharable hash, but I don't see even a
prototype in this thread. So, probably, it is a matter of a separate
discussion.

Also, I took a look at the code. It looks like it needs some extra work.
I am not a big expert in this area, so I'm sorry if questions are obvious.

1. What would happen if this assumption is not met?

+         * MEMO: We assume this pathlist keeps at least one AppendPath that
+         * represents partitioned table-scan, symmetric or asymmetric
+         * partition-wise join. It is not correct right now, however, a 
hook
+         * on add_path() to give additional decision for path removel 
allows
+         * to retain this kind of AppendPath, regardless of its cost.

2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into
PG_TRY/PG_CATCH? What errors do we expect?

3. It looks like a crutch. If it isn't, I'd like to see a better comment
about why "dynamic programming" is not applicable here.
And shouldn't we also handle a root->join_cur_level?

+                /* temporary disables "dynamic programming" algorithm */
+                root->join_rel_level = NULL;

4. This change looks like it can lead to a memory leak for old code.
Maybe it is never the case, but again I think it worth a comment.

-    /* If there's nothing to adjust, don't call this function. */
-    Assert(nappinfos >= 1 && appinfos != NULL);
+    /* If there's nothing to adjust, just return a duplication */
+    if (nappinfos == 0)
+        return copyObject(node);

5. extract_asymmetric_partitionwise_subjoin() lacks a comment

The new status of this patch is: Waiting on Author

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#14Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Anastasia Lubennikova (#13)
Re: Asymmetric partition-wise JOIN

On 09.11.2020 13:53, Anastasia Lubennikova wrote:

On 21.08.2020 09:02, Andrey V. Lepikhov wrote:

On 7/1/20 2:10 PM, Daniel Gustafsson wrote:

On 27 Dec 2019, at 08:34, Kohei KaiGai <kaigai@heterodb.com> wrote:

The attached v2 fixed the problem, and regression test finished
correctly.

This patch no longer applies to HEAD, please submit an rebased version.
Marking the entry Waiting on Author in the meantime.

Rebased version of the patch on current master (d259afa736).

I rebased it because it is a base of my experimental feature than we
don't break partitionwise join of a relation with foreign partition
and a local relation if we have info that remote server has foreign
table link to the local relation (by analogy with shippable extensions).

Maybe mark as 'Needs review'?

Status update for a commitfest entry.

According to cfbot, the patch fails to apply. Could you please send a
rebased version?

This thread was inactive for quite some time. Is anyone going to
continue working on it?

I see some interest in the idea of sharable hash, but I don't see even
a prototype in this thread. So, probably, it is a matter of a separate
discussion.

Also, I took a look at the code. It looks like it needs some extra
work. I am not a big expert in this area, so I'm sorry if questions
are obvious.

1. What would happen if this assumption is not met?

+         * MEMO: We assume this pathlist keeps at least one 
AppendPath that
+         * represents partitioned table-scan, symmetric or asymmetric
+         * partition-wise join. It is not correct right now, however, 
a hook
+         * on add_path() to give additional decision for path removel 
allows
+         * to retain this kind of AppendPath, regardless of its cost.

2. Why do we wrap extract_asymmetric_partitionwise_subjoin() call into
PG_TRY/PG_CATCH? What errors do we expect?

3. It looks like a crutch. If it isn't, I'd like to see a better
comment about why "dynamic programming" is not applicable here.
And shouldn't we also handle a root->join_cur_level?

+                /* temporary disables "dynamic programming" algorithm */
+                root->join_rel_level = NULL;

4. This change looks like it can lead to a memory leak for old code.
Maybe it is never the case, but again I think it worth a comment.

-    /* If there's nothing to adjust, don't call this function. */
-    Assert(nappinfos >= 1 && appinfos != NULL);
+    /* If there's nothing to adjust, just return a duplication */
+    if (nappinfos == 0)
+        return copyObject(node);

5. extract_asymmetric_partitionwise_subjoin() lacks a comment

The new status of this patch is: Waiting on Author

Status update for a commitfest entry.

This entry was inactive during this CF, so I've marked it as returned
with feedback. Feel free to resubmit an updated version to a future
commitfest.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#15Andrei Lepikhov
lepihov@gmail.com
In reply to: Anastasia Lubennikova (#14)
Re: Asymmetric partition-wise JOIN

On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:

This entry was inactive during this CF, so I've marked it as returned
with feedback. Feel free to resubmit an updated version to a future
commitfest.

Attached version is rebased on current master and fixes problems with
complex parameterized plans - 'reparameterize by child' feature.
Problems with reparameterization machinery can be demonstrated by TPC-H
benchmark.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

0001-Asymmetric-partitionwise-join.patchtext/x-patch; charset=UTF-8; name=0001-Asymmetric-partitionwise-join.patchDownload+509-18
#16Andrei Lepikhov
lepihov@gmail.com
In reply to: Anastasia Lubennikova (#14)
Re: Asymmetric partition-wise JOIN

On 11/30/20 7:43 PM, Anastasia Lubennikova wrote:

This entry was inactive during this CF, so I've marked it as returned
with feedback. Feel free to resubmit an updated version to a future
commitfest.

I return the patch to commitfest. My current reason differs from reason
of origin author.
This patch can open a door for more complex optimizations in the
partitionwise join push-down technique.
I mean, we can push-down join not only of two partitioned tables with
the same partition schema, but a partitioned (sharded) table with an
arbitrary subplan that is provable independent of local resources.

Example:

CREATE TABLE p(a int) PARTITION BY HASH (a);
CREATE TABLE p1 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE p2 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE p3 PARTITION OF p FOR VALUES WITH (MODULUS 3, REMAINDER 2);

SELECT * FROM p, (SELECT * FROM generate_series(1,2) AS a) AS s
WHERE p.a=s.a;

Hash Join
Hash Cond: (p.a = a.a)
-> Append
-> Seq Scan on p1 p_1
-> Seq Scan on p2 p_2
-> Seq Scan on p3 p_3
-> Hash
-> Function Scan on generate_series a

But with asymmetric join feature we have the plan:

Append
-> Hash Join
Hash Cond: (p_1.a = a.a)
-> Seq Scan on p1 p_1
-> Hash
-> Function Scan on generate_series a
-> Hash Join
Hash Cond: (p_2.a = a.a)
-> Seq Scan on p2 p_2
-> Hash
-> Function Scan on generate_series a
-> Hash Join
Hash Cond: (p_3.a = a.a)
-> Seq Scan on p3 p_3
-> Hash
-> Function Scan on generate_series a

In the case of FDW-sharding it means that if we can prove that the inner
relation is independent from the execution server, we can push-down
these joins and execute it in parallel.

--
regards,
Andrey Lepikhov
Postgres Professional

#17Andrei Lepikhov
lepihov@gmail.com
In reply to: Andrei Lepikhov (#16)
Re: Asymmetric partition-wise JOIN

Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

v14-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v14-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download+682-29
#18Alexander Pyhalov
a.pyhalov@postgrespro.ru
In reply to: Andrei Lepikhov (#17)
Re: Asymmetric partition-wise JOIN

Andrey Lepikhov писал 2021-05-27 07:27:

Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.

Hi.
I've tested this patch and haven't found issues, but I have some
comments.

src/backend/optimizer/path/joinrels.c:

1554
1555 /*
1556 * Build RelOptInfo on JOIN of each partition of the outer relation
and the inner
1557 * relation. Return List of such RelOptInfo's. Return NIL, if at
least one of
1558 * these JOINs are impossible to build.
1559 */
1560 static List *
1561 extract_asymmetric_partitionwise_subjoin(PlannerInfo *root,
1562
RelOptInfo *joinrel,
1563
AppendPath *append_path,
1564
RelOptInfo *inner_rel,
1565
JoinType jointype,
1566
JoinPathExtraData *extra)
1567 {
1568 List *result = NIL;
1569 ListCell *lc;
1570
1571 foreach (lc, append_path->subpaths)
1572 {
1573 Path *child_path = lfirst(lc);
1574 RelOptInfo *child_rel =
child_path->parent;
1575 Relids child_join_relids;
1576 Relids parent_relids;
1577 RelOptInfo *child_join_rel;
1578 SpecialJoinInfo *child_sjinfo;
1579 List *child_restrictlist;

Variable names - child_join_rel and child_join_relids seem to be
inconsistent with rest of the file (I see child_joinrelids in
try_partitionwise_join() and child_joinrel in try_partitionwise_join()
and get_matching_part_pairs()).

1595 child_join_rel = build_child_join_rel(root,
1596
child_rel,
1597
inner_rel,
1598
joinrel,
1599
child_restrictlist,
1600
child_sjinfo,
1601
jointype);
1602 if (!child_join_rel)
1603 {
1604 /*
1605 * If can't build JOIN between
inner relation and one of the outer
1606 * partitions - return immediately.
1607 */
1608 return NIL;
1609 }

When build_child_join_rel() can return NULL?
If I read code correctly, joinrel is created in the begining of
build_child_join_rel() with makeNode(), makeNode() wraps newNode() and
newNode() uses MemoryContextAllocZero()/MemoryContextAllocZeroAligned(),
which would error() on alloc() failure.

1637
1638 static bool
1639 is_asymmetric_join_capable(PlannerInfo *root,
1640 RelOptInfo
*outer_rel,
1641 RelOptInfo
*inner_rel,
1642 JoinType
jointype)
1643 {

Function misses a comment.

1656 /*
1657 * Don't allow asymmetric JOIN of two append subplans.
1658 * In the case of a parameterized NL join, a
reparameterization procedure will
1659 * lead to large memory allocations and a CPU consumption:
1660 * each reparameterize will induce subpath duplication,
creating new
1661 * ParamPathInfo instance and increasing of ppilist up to
number of partitions
1662 * in the inner. Also, if we have many partitions, each
bitmapset
1663 * variable will large and many leaks of such variable
(caused by relid
1664 * replacement) will highly increase memory consumption.
1665 * So, we deny such paths for now.
1666 */

Missing word:
each bitmapset variable will large => each bitmapset variable will be
large

1694 foreach (lc, outer_rel->pathlist)
1695 {
1696 AppendPath *append_path = lfirst(lc);
1697
1698 /*
1699 * MEMO: We assume this pathlist keeps at least one
AppendPath that
1700 * represents partitioned table-scan, symmetric or
asymmetric
1701 * partition-wise join. It is not correct right
now, however, a hook
1702 * on add_path() to give additional decision for
path removal allows
1703 * to retain this kind of AppendPath, regardless of
its cost.
1704 */
1705 if (IsA(append_path, AppendPath))

What hook do you refer to?

src/backend/optimizer/plan/setrefs.c:

282 /*
283 * Adjust RT indexes of AppendRelInfos and add to final
appendrels list.
284 * We assume the AppendRelInfos were built during planning
and don't need
285 * to be copied.
286 */
287 foreach(lc, root->append_rel_list)
288 {
289 AppendRelInfo *appinfo = lfirst_node(AppendRelInfo,
lc);
290 AppendRelInfo *newappinfo;
291
292 /* flat copy is enough since all valuable fields are
scalars */
293 newappinfo = (AppendRelInfo *)
palloc(sizeof(AppendRelInfo));
294 memcpy(newappinfo, appinfo, sizeof(AppendRelInfo));

You've changed function to copy appinfo, so now comment is incorrect.

src/backend/optimizer/util/appendinfo.c:
588 /* Construct relids set for the immediate parent of the
given child. */
589 normal_relids = bms_copy(child_relids);
590 for (cnt = 0; cnt < nappinfos; cnt++)
591 {
592 AppendRelInfo *appinfo = appinfos[cnt];
593
594 parent_relids = bms_add_member(parent_relids,
appinfo->parent_relid);
595 normal_relids = bms_del_member(normal_relids,
appinfo->child_relid);
596 }
597 parent_relids = bms_union(parent_relids, normal_relids);

Do I understand correctly that now parent_relids also contains relids of
relations from 'global' inner relation, which we join to childs?

--
Best regards,
Alexander Pyhalov,
Postgres Professional

#19Andrei Lepikhov
lepihov@gmail.com
In reply to: Alexander Pyhalov (#18)
Re: Asymmetric partition-wise JOIN

On 18/6/21 15:02, Alexander Pyhalov wrote:

Andrey Lepikhov писал 2021-05-27 07:27:

Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.

Hi.
I've tested this patch and haven't found issues, but I have some comments.

Thank you for review!

Variable names - child_join_rel and child_join_relids seem to be
inconsistent with rest of the file

fixed

When build_child_join_rel() can return NULL?

Fixed

Missing word:
each bitmapset variable will large => each bitmapset variable will be large

Fixed

What hook do you refer to?

Removed> You've changed function to copy appinfo, so now comment is
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also
contains relids of

relations from 'global' inner relation, which we join to childs?

Yes

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

v15-0001-Asymmetric-partitionwise-join.patchtext/plain; charset=UTF-8; name=v15-0001-Asymmetric-partitionwise-join.patch; x-mac-creator=0; x-mac-type=0Download+672-31
#20Zhihong Yu
zyu@yugabyte.com
In reply to: Andrei Lepikhov (#19)
Re: Asymmetric partition-wise JOIN

On Mon, Jul 5, 2021 at 2:57 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru>
wrote:

On 18/6/21 15:02, Alexander Pyhalov wrote:

Andrey Lepikhov писал 2021-05-27 07:27:

Next version of the patch.
For searching any problems I forced this patch during 'make check'
tests. Some bugs were found and fixed.

Hi.
I've tested this patch and haven't found issues, but I have some

comments.
Thank you for review!

Variable names - child_join_rel and child_join_relids seem to be
inconsistent with rest of the file

fixed

When build_child_join_rel() can return NULL?

Fixed

Missing word:
each bitmapset variable will large => each bitmapset variable will be

large
Fixed

What hook do you refer to?

Removed> You've changed function to copy appinfo, so now comment is
incorrect.
Thanks, fixed> Do I understand correctly that now parent_relids also
contains relids of

relations from 'global' inner relation, which we join to childs?

Yes

--
regards,
Andrey Lepikhov
Postgres Professional

Hi,

relations because it could cause CPU and memory huge consumption
during reparameterization of NestLoop path.

CPU and memory huge consumption -> huge consumption of CPU and memory

+ * relation. Return List of such RelOptInfo's. Return NIL, if at least one
of
+ * these JOINs are impossible to build.

at least one of these JOINs are impossible to build. -> at least one
of these JOINs is impossible to build.

+            * Can't imagine situation when join relation already exists.
But in
+            * the 'partition_join' regression test it happens.
+            * It may be an indicator of possible problems.

Should a log be added in the above case ?

+is_asymmetric_join_capable(PlannerInfo *root,

is_asymmetric_join_capable -> is_asymmetric_join_feasible

Cheers

#21Andrei Lepikhov
lepihov@gmail.com
In reply to: Zhihong Yu (#20)
#22Alexander Pyhalov
a.pyhalov@postgrespro.ru
In reply to: Andrei Lepikhov (#21)
#23Andrei Lepikhov
lepihov@gmail.com
In reply to: Zhihong Yu (#20)
#24Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Andrei Lepikhov (#23)
#25Aleksander Alekseev
aleksander@timescale.com
In reply to: Ibrar Ahmed (#24)
#26Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Aleksander Alekseev (#25)
#27Andrei Lepikhov
lepihov@gmail.com
In reply to: Jaime Casanova (#26)
#28Andrei Lepikhov
lepihov@gmail.com
In reply to: Andrei Lepikhov (#27)
#29Alexander Pyhalov
a.pyhalov@postgrespro.ru
In reply to: Andrei Lepikhov (#28)
#30Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Pyhalov (#29)
#31Andrei Lepikhov
lepihov@gmail.com
In reply to: Alexander Korotkov (#30)
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrei Lepikhov (#31)
#33Andrei Lepikhov
lepihov@gmail.com
In reply to: Alexander Korotkov (#32)
#34Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Andrei Lepikhov (#33)
#35Andrei Lepikhov
lepihov@gmail.com
In reply to: Ashutosh Bapat (#34)
#36Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Andrei Lepikhov (#35)
#37Andrei Lepikhov
lepihov@gmail.com
In reply to: Ashutosh Bapat (#36)
#38Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Andrei Lepikhov (#37)
#39Andrei Lepikhov
lepihov@gmail.com
In reply to: Ashutosh Bapat (#38)
#40Andrei Lepikhov
lepihov@gmail.com
In reply to: Alexander Korotkov (#32)
#41Andrei Lepikhov
lepihov@gmail.com
In reply to: Ashutosh Bapat (#38)
#42Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrei Lepikhov (#40)
#43Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrei Lepikhov (#41)
#44Andrei Lepikhov
lepihov@gmail.com
In reply to: Alexander Korotkov (#42)