d25ea01275 and partitionwise join

Started by Amit Langotealmost 7 years ago28 messageshackers
Jump to latest
#1Amit Langote
Langote_Amit_f8@lab.ntt.co.jp

Hi Tom,

I think an assumption of d25ea01275 breaks partitionwise join. Sorry
it took me a while to report it.

In /messages/by-id/8168.1560446056@sss.pgh.pa.us, Tom wrote:

I poked into this and found the cause. For the sample query, we have
an EquivalenceClass containing the expression
COALESCE(COALESCE(Var_1_1, Var_2_1), Var_3_1)
where each of the Vars belongs to an appendrel parent.
add_child_rel_equivalences() needs to add expressions representing the
transform of that to each child relation. That is, if the children
of table 1 are A1 and A2, of table 2 are B1 and B2, and of table 3
are C1 and C2, what we'd like to add are the expressions
COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_3_1)
COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_3_1)
COALESCE(COALESCE(Var_1_1, Var_B1_1), Var_3_1)
COALESCE(COALESCE(Var_1_1, Var_B2_1), Var_3_1)
COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C1_1)
COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C2_1)
However, what it's actually producing is additional combinations for
each appendrel after the first, because each call also mutates the
previously-added child expressions. So in this example we also get
COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_3_1)
COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_3_1)
COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_C1_1)
COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_C2_1)
COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_C1_1)
COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_C2_1)
With two appendrels involved, that's O(N^2) expressions; with
three appendrels, more like O(N^3).

This is by no means specific to FULL JOINs; you could get the same
behavior with join clauses like "WHERE t1.a + t2.b + t3.c = t4.d".

These extra expressions don't have any use, since we're not
going to join the children directly to each other.

...unless partition wise join thinks they can be joined. Partition
wise join can't handle 3-way full joins today, but only because it's
broken itself when trying to match a full join clause to the partition
key due to one side being a COALESCE expression. Consider this
example query:

-- p is defined as:
-- create table p (a int) partition by list (a);
-- create table p1 partition of p for values in (1);
-- create table p2 partition of p for values in (2);
explain select * from p t1 full outer join p t2 using (a) full outer
join p t3 using (a) full outer join p t4 using (a) order by 1;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort (cost=16416733.32..16628145.85 rows=84565012 width=4)
Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a))
-> Merge Full Join (cost=536957.40..1813748.77 rows=84565012 width=4)
Merge Cond: (t4.a = (COALESCE(COALESCE(t1.a, t2.a), t3.a)))
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: t4.a
-> Append (cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on p2 t4_1 (cost=0.00..35.50
rows=2550 width=4)
-> Materialize (cost=536546.83..553128.21 rows=3316275 width=12)
-> Sort (cost=536546.83..544837.52 rows=3316275 width=12)
Sort Key: (COALESCE(COALESCE(t1.a, t2.a), t3.a))
-> Merge Full Join (cost=14254.85..64024.48
rows=3316275 width=12)
Merge Cond: (t3.a = (COALESCE(t1.a, t2.a)))
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: t3.a
-> Append (cost=0.00..96.50
rows=5100 width=4)
-> Seq Scan on p1 t3
(cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on p2 t3_1
(cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=13844.29..14169.41
rows=130050 width=8)
Sort Key: (COALESCE(t1.a, t2.a))
-> Merge Full Join
(cost=821.13..2797.38 rows=130050 width=8)
Merge Cond: (t1.a = t2.a)
-> Sort (cost=410.57..423.32
rows=5100 width=4)
Sort Key: t1.a
-> Append
(cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on p1
t1 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on p2
t1_1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=410.57..423.32
rows=5100 width=4)
Sort Key: t2.a
-> Append
(cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on p1
t2 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on p2
t2_1 (cost=0.00..35.50 rows=2550 width=4)

-- turn on enable_partitionwise_join
set enable_partitionwise_join to on;
explain select * from p t1 full outer join p t2 using (a) full outer
join p t3 using (a) full outer join p t4 using (a) order by 1;
QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort (cost=16385259.94..16596672.47 rows=84565012 width=4)
Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a))
-> Merge Full Join (cost=505484.02..1782275.39 rows=84565012 width=4)
Merge Cond: (t4.a = (COALESCE(COALESCE(t1.a, t2.a), t3.a)))
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: t4.a
-> Append (cost=0.00..96.50 rows=5100 width=4)
-> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on p2 t4_1 (cost=0.00..35.50
rows=2550 width=4)
-> Materialize (cost=505073.45..521654.83 rows=3316275 width=12)
-> Sort (cost=505073.45..513364.14 rows=3316275 width=12)
Sort Key: (COALESCE(COALESCE(t1.a, t2.a), t3.a))
-> Merge Full Join (cost=7653.92..32551.10
rows=3316275 width=12)
Merge Cond: (t3.a = (COALESCE(t1.a, t2.a)))
-> Sort (cost=410.57..423.32 rows=5100 width=4)
Sort Key: t3.a
-> Append (cost=0.00..96.50
rows=5100 width=4)
-> Seq Scan on p1 t3
(cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on p2 t3_1
(cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=7243.35..7405.91 rows=65024 width=8)
Sort Key: (COALESCE(t1.a, t2.a))
-> Result (cost=359.57..2045.11
rows=65024 width=8)
-> Append
(cost=359.57..2045.11 rows=65024 width=8)
-> Merge Full Join
(cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (t1.a = t2.a)
-> Sort
(cost=179.78..186.16 rows=2550 width=4)
Sort Key: t1.a
-> Seq Scan
on p1 t1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort
(cost=179.78..186.16 rows=2550 width=4)
Sort Key: t2.a
-> Seq Scan
on p1 t2 (cost=0.00..35.50 rows=2550 width=4)
-> Merge Full Join
(cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (t1_1.a = t2_1.a)
-> Sort
(cost=179.78..186.16 rows=2550 width=4)
Sort Key: t1_1.a
-> Seq Scan
on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4)
-> Sort
(cost=179.78..186.16 rows=2550 width=4)
Sort Key: t2_1.a
-> Seq Scan
on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4)

See how it only managed to use partition wise join up to 2-way join,
but gives up at 3-way join and higher, because the join condition
looks like this: t3.a = (COALESCE(t1.a, t2.a). When building the join
relation (t1, t2, t3) between (t3) and (t1, t2), it fails to see that
COALESCE(t1.a, t2.a) actually matches the partition key of (t1, t2).
When I fix the code that does the matching and run with merge joins
disabled, I can get a plan where the whole 4-way join is partitioned:

explain select * from p t1 full outer join p t2 using (a) full outer
join p t3 using (a) full outer join p t4 using (a) order by 1;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────
Gather Merge (cost=831480.11..1859235.87 rows=8808720 width=4)
Workers Planned: 2
-> Sort (cost=830480.09..841490.99 rows=4404360 width=4)
Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a))
-> Parallel Append (cost=202.12..224012.93 rows=4404360 width=4)
-> Hash Full Join (cost=202.12..201991.13 rows=5285232 width=4)
Hash Cond: (COALESCE(COALESCE(t1.a, t2.a), t3.a) = t4.a)
-> Hash Full Join (cost=134.75..15904.32
rows=414528 width=12)
Hash Cond: (COALESCE(t1.a, t2.a) = t3.a)
-> Hash Full Join (cost=67.38..1247.18
rows=32512 width=8)
Hash Cond: (t1.a = t2.a)
-> Seq Scan on p1 t1
(cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on p1 t2
(cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on p1 t3
(cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on p1 t4 (cost=0.00..35.50
rows=2550 width=4)
-> Hash Full Join (cost=202.12..201991.13 rows=5285232 width=4)
Hash Cond: (COALESCE(COALESCE(t1_1.a, t2_1.a),
t3_1.a) = t4_1.a)
-> Hash Full Join (cost=134.75..15904.32
rows=414528 width=12)
Hash Cond: (COALESCE(t1_1.a, t2_1.a) = t3_1.a)
-> Hash Full Join (cost=67.38..1247.18
rows=32512 width=8)
Hash Cond: (t1_1.a = t2_1.a)
-> Seq Scan on p2 t1_1
(cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on p2 t2_1
(cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on p2 t3_1
(cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=35.50..35.50 rows=2550 width=4)
-> Seq Scan on p2 t4_1 (cost=0.00..35.50
rows=2550 width=4)
(31 rows)

But with merge joins enabled:

explain select * from p t1 full outer join p t2 using (a) full outer
join p t3 using (a) full outer join p t4 using (a) order by 1;
ERROR: could not find pathkey item to sort

That's because, there's no child COALESCE(t1_1.a, t2_1.a) expression
in the EC that contains COALESCE(t1.a, t2.a), where t1_1 and t2_1
represent the 1st partition of t1 and t2, resp. The problem is that
add_child_rel_equivalences(), as of d25ea01275, only adds the
following child expressions of COALESCE(t1.a, t2.a):

-- when translating t1
COALESCE(t1_1.a, t2.a)
COALESCE(t1_2.a, t2.a)
-- when translating t2
COALESCE(t1.a, t2_1.a)
COALESCE(t1.a, t2_2.a)

whereas previously, the following would be added too when translating t2:

COALESCE(t1_1.a, t2_1.a)
COALESCE(t1_1.a, t2_2.a)
COALESCE(t1_2.a, t2_1.a)
COALESCE(t1_2.a, t2_2.a)

Note that of those, only COALESCE(t1_1.a, t2_1.a) and COALESCE(t1_2.a,
t2_2.a) are interesting, because partition wise join will only ever
consider pairs (t1_1, t2_1) and (t1_2, t2_2) to be joined.

We can get the needed child expressions and still avoid the
combinatorial explosion in the size of resulting EC members list if we
taught add_child_rel_equivalences() to only translate ECs that the
input parent relation is capable of producing. So, COALESCE(t1.a,
t2.a) will not be translated if the input relation is only (t1) or
(t2), that is, when called from set_append_rel_size(). Instead it
would be translated if it's passed the joinrel (t1, t2). IOW, teach
build_child_join_rel() to call add_child_rel_equivalences(), which
I've tried to implement in the attached.

I have attached two patches.

0001 - fix partitionwise join to work correctly with n-way joins of
which some are full joins (+ cosmetic improvements around the code
that was touched)
0002 - fix to translate multi-relation EC members correctly

Thanks,
Amit

Attachments:

0001-Fix-partitionwise-join-code-to-handle-FULL-OUTER-JOI.patchapplication/octet-stream; name=0001-Fix-partitionwise-join-code-to-handle-FULL-OUTER-JOI.patchDownload+408-78
0002-Translate-multi-relation-EC-members-in-a-separate-pa.patchapplication/octet-stream; name=0002-Translate-multi-relation-EC-members-in-a-separate-pa.patchDownload+285-29
#2Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#1)
Re: d25ea01275 and partitionwise join

On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:

0001 - fix partitionwise join to work correctly with n-way joins of
which some are full joins (+ cosmetic improvements around the code
that was touched)

Here are my comments about the cosmetic improvements: they seem pretty
large to me, so I'd make a separate patch for that. In addition, I'd
move have_partkey_equi_join() and match_expr_to_partition_keys() to
relnode.c, because these functions are only used in that file.

Best regards,
Etsuro Fujita

#3Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#2)
Re: d25ea01275 and partitionwise join

Fujita-san,

Thanks for looking at this.

On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:

0001 - fix partitionwise join to work correctly with n-way joins of
which some are full joins (+ cosmetic improvements around the code
that was touched)

Here are my comments about the cosmetic improvements: they seem pretty
large to me, so I'd make a separate patch for that.

OK, my bad that I added so many cosmetic changes into a patch that is
meant to fix the main issue. Just to clarify, I'm proposing these
cosmetic improvements to better clarify the terminological separation
between nullable and non-nullable partition keys, which I found a bit
hard to understand as is.

I've broken the patch into two: 0001 contains only cosmetic changes
and 0002 the fix for handling full joins properly. Would you rather
that be reversed?

In addition, I'd
move have_partkey_equi_join() and match_expr_to_partition_keys() to
relnode.c, because these functions are only used in that file.

I hadn't noticed that. Makes sense to move them to relnode.c, which
is implemented in 0001.

Thanks,
Amit

Attachments:

v2-0001-Some-cosmetic-improvements-to-partitionwise-join-.patchapplication/octet-stream; name=v2-0001-Some-cosmetic-improvements-to-partitionwise-join-.patchDownload+271-224
v2-0002-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchapplication/octet-stream; name=v2-0002-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchDownload+299-17
v2-0003-Add-multi-relation-EC-child-members-in-a-separate.patchapplication/octet-stream; name=v2-0003-Add-multi-relation-EC-child-members-in-a-separate.patchDownload+287-29
#4Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#3)
Re: d25ea01275 and partitionwise join

On Thu, Jul 18, 2019 at 11:18 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:

0001 - fix partitionwise join to work correctly with n-way joins of
which some are full joins (+ cosmetic improvements around the code
that was touched)

Here are my comments about the cosmetic improvements: they seem pretty
large to me, so I'd make a separate patch for that.

OK, my bad that I added so many cosmetic changes into a patch that is
meant to fix the main issue. Just to clarify, I'm proposing these
cosmetic improvements to better clarify the terminological separation
between nullable and non-nullable partition keys, which I found a bit
hard to understand as is.

OK, thanks for the explanation!

I've broken the patch into two: 0001 contains only cosmetic changes
and 0002 the fix for handling full joins properly. Would you rather
that be reversed?

I like this order.

In addition, I'd
move have_partkey_equi_join() and match_expr_to_partition_keys() to
relnode.c, because these functions are only used in that file.

I hadn't noticed that. Makes sense to move them to relnode.c, which
is implemented in 0001.

Thanks for including that! Will review.

Best regards,
Etsuro Fujita

#5Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#4)
Re: d25ea01275 and partitionwise join

Fujita-san,

On Thu, Jul 18, 2019 at 8:10 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Thu, Jul 18, 2019 at 11:18 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Tue, Jul 16, 2019 at 8:22 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Tue, Jul 2, 2019 at 6:29 PM Amit Langote <amitlangote09@gmail.com> wrote:

0001 - fix partitionwise join to work correctly with n-way joins of
which some are full joins (+ cosmetic improvements around the code
that was touched)

Here are my comments about the cosmetic improvements: they seem pretty
large to me, so I'd make a separate patch for that.

OK, my bad that I added so many cosmetic changes into a patch that is
meant to fix the main issue. Just to clarify, I'm proposing these
cosmetic improvements to better clarify the terminological separation
between nullable and non-nullable partition keys, which I found a bit
hard to understand as is.

OK, thanks for the explanation!

I've broken the patch into two: 0001 contains only cosmetic changes
and 0002 the fix for handling full joins properly. Would you rather
that be reversed?

I like this order.

In addition, I'd
move have_partkey_equi_join() and match_expr_to_partition_keys() to
relnode.c, because these functions are only used in that file.

I hadn't noticed that. Makes sense to move them to relnode.c, which
is implemented in 0001.

Thanks for including that! Will review.

To avoid losing track of this, I've added this to November CF.

https://commitfest.postgresql.org/25/2278/

I know there is one more patch beside the partitionwise join fix, but
I've set the title to suggest that this is related mainly to
partitionwise joins.

Thanks,
Amit

#6Richard Guo
guofenglinux@gmail.com
In reply to: Amit Langote (#5)
Re: d25ea01275 and partitionwise join

Hi Amit,

On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com>
wrote:

Fujita-san,

To avoid losing track of this, I've added this to November CF.

https://commitfest.postgresql.org/25/2278/

I know there is one more patch beside the partitionwise join fix, but
I've set the title to suggest that this is related mainly to
partitionwise joins.

Thank you for working on this. Currently partitionwise join does not
take COALESCE expr into consideration when matching to partition keys.
This is a problem.

BTW, a rebase is needed for the patch set.

Thanks
Richard

#7Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#6)
Re: d25ea01275 and partitionwise join

Hi Amit,

On Wed, Sep 4, 2019 at 3:30 PM Richard Guo <riguo@pivotal.io> wrote:

Hi Amit,

On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com>
wrote:

Fujita-san,

To avoid losing track of this, I've added this to November CF.

https://commitfest.postgresql.org/25/2278/

I know there is one more patch beside the partitionwise join fix, but
I've set the title to suggest that this is related mainly to
partitionwise joins.

Thank you for working on this. Currently partitionwise join does not
take COALESCE expr into consideration when matching to partition keys.
This is a problem.

BTW, a rebase is needed for the patch set.

I'm reviewing v2-0002 and I have concern about how COALESCE expr is
processed in match_join_arg_to_partition_keys().

If there is a COALESCE expr with first arg being non-partition key expr
and second arg being partition key, the patch would match it to the
partition key, which may result in wrong results in some cases.

For instance, consider the partition table below:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

So with patch v2-0002, the following query will be planned with
partitionwise join.

# explain (costs off)
select * from (p as t1 full join p as t2 on t1.k = t2.k) as
t12(k1,val1,k2,val2)
full join p as t3 on COALESCE(t12.val1, t12.k1)
= t3.k;
QUERY PLAN
----------------------------------------------------------
Append
-> Hash Full Join
Hash Cond: (COALESCE(t1.val, t1.k) = t3.k)
-> Hash Full Join
Hash Cond: (t1.k = t2.k)
-> Seq Scan on p_1 t1
-> Hash
-> Seq Scan on p_1 t2
-> Hash
-> Seq Scan on p_1 t3
-> Hash Full Join
Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k)
-> Hash Full Join
Hash Cond: (t1_1.k = t2_1.k)
-> Seq Scan on p_2 t1_1
-> Hash
-> Seq Scan on p_2 t2_1
-> Hash
-> Seq Scan on p_2 t3_1
(19 rows)

But as t1.val is not a partition key, actually we cannot use
partitionwise join here.

If we insert below data into the table, we will get wrong results for
the query above.

insert into p select 5,15;
insert into p select 15,5;

Thanks
Richard

#8Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Richard Guo (#6)
Re: d25ea01275 and partitionwise join

Hello Richard,

On Wed, Sep 4, 2019 at 4:30 PM Richard Guo <riguo@pivotal.io> wrote:

Hi Amit,

On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote:

Fujita-san,

To avoid losing track of this, I've added this to November CF.

https://commitfest.postgresql.org/25/2278/

I know there is one more patch beside the partitionwise join fix, but
I've set the title to suggest that this is related mainly to
partitionwise joins.

Thank you for working on this. Currently partitionwise join does not
take COALESCE expr into consideration when matching to partition keys.
This is a problem.

BTW, a rebase is needed for the patch set.

Thanks a lot for looking at this.

I tried rebasing today and found that adopting this patch to the
following recent commit to equivalence processing code would take some
time that I don't currently have.

commit 3373c7155350cf6fcd51dd090f29e1332901e329
Author: David Rowley <drowley@postgresql.org>
Date: Sun Jul 21 17:30:58 2019 +1200

Speed up finding EquivalenceClasses for a given set of rels

I will come back to this in a couple of weeks, along with addressing
your other comments.

Thanks,
Amit

#9Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Richard Guo (#7)
Re: d25ea01275 and partitionwise join

Hi Richard,

Thanks a lot for taking a close look at the patch and sorry about the delay.

On Wed, Sep 4, 2019 at 5:29 PM Richard Guo <riguo@pivotal.io> wrote:

On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com> wrote:

I'm reviewing v2-0002 and I have concern about how COALESCE expr is
processed in match_join_arg_to_partition_keys().

If there is a COALESCE expr with first arg being non-partition key expr
and second arg being partition key, the patch would match it to the
partition key, which may result in wrong results in some cases.

For instance, consider the partition table below:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

So with patch v2-0002, the following query will be planned with
partitionwise join.

# explain (costs off)
select * from (p as t1 full join p as t2 on t1.k = t2.k) as t12(k1,val1,k2,val2)
full join p as t3 on COALESCE(t12.val1, t12.k1) = t3.k;
QUERY PLAN
----------------------------------------------------------
Append
-> Hash Full Join
Hash Cond: (COALESCE(t1.val, t1.k) = t3.k)
-> Hash Full Join
Hash Cond: (t1.k = t2.k)
-> Seq Scan on p_1 t1
-> Hash
-> Seq Scan on p_1 t2
-> Hash
-> Seq Scan on p_1 t3
-> Hash Full Join
Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k)
-> Hash Full Join
Hash Cond: (t1_1.k = t2_1.k)
-> Seq Scan on p_2 t1_1
-> Hash
-> Seq Scan on p_2 t2_1
-> Hash
-> Seq Scan on p_2 t3_1
(19 rows)

But as t1.val is not a partition key, actually we cannot use
partitionwise join here.

If we insert below data into the table, we will get wrong results for
the query above.

insert into p select 5,15;
insert into p select 15,5;

Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k
for partitionwise join as the COALESCE expression might as well output
the value of val1 which doesn't conform to partitioning.

I've fixed match_join_arg_to_partition_keys() to catch that case and
fail. Added a test case as well.

Please find attached updated patches.

Thanks,
Amit

Attachments:

v3-0003-Add-multi-relation-EC-child-members-in-a-separate.patchapplication/octet-stream; name=v3-0003-Add-multi-relation-EC-child-members-in-a-separate.patchDownload+304-33
v3-0001-Some-cosmetic-improvements-to-partitionwise-join-.patchapplication/octet-stream; name=v3-0001-Some-cosmetic-improvements-to-partitionwise-join-.patchDownload+271-224
v3-0002-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchapplication/octet-stream; name=v3-0002-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchDownload+370-17
#10Richard Guo
guofenglinux@gmail.com
In reply to: Amit Langote (#9)
Re: d25ea01275 and partitionwise join

On Thu, Sep 19, 2019 at 4:15 PM Amit Langote <amitlangote09@gmail.com>
wrote:

Hi Richard,

Thanks a lot for taking a close look at the patch and sorry about the
delay.

On Wed, Sep 4, 2019 at 5:29 PM Richard Guo <riguo@pivotal.io> wrote:

On Wed, Sep 4, 2019 at 10:01 AM Amit Langote <amitlangote09@gmail.com>

wrote:

I'm reviewing v2-0002 and I have concern about how COALESCE expr is
processed in match_join_arg_to_partition_keys().

If there is a COALESCE expr with first arg being non-partition key expr
and second arg being partition key, the patch would match it to the
partition key, which may result in wrong results in some cases.

For instance, consider the partition table below:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

So with patch v2-0002, the following query will be planned with
partitionwise join.

# explain (costs off)
select * from (p as t1 full join p as t2 on t1.k = t2.k) as

t12(k1,val1,k2,val2)

full join p as t3 on COALESCE(t12.val1,

t12.k1) = t3.k;

QUERY PLAN
----------------------------------------------------------
Append
-> Hash Full Join
Hash Cond: (COALESCE(t1.val, t1.k) = t3.k)
-> Hash Full Join
Hash Cond: (t1.k = t2.k)
-> Seq Scan on p_1 t1
-> Hash
-> Seq Scan on p_1 t2
-> Hash
-> Seq Scan on p_1 t3
-> Hash Full Join
Hash Cond: (COALESCE(t1_1.val, t1_1.k) = t3_1.k)
-> Hash Full Join
Hash Cond: (t1_1.k = t2_1.k)
-> Seq Scan on p_2 t1_1
-> Hash
-> Seq Scan on p_2 t2_1
-> Hash
-> Seq Scan on p_2 t3_1
(19 rows)

But as t1.val is not a partition key, actually we cannot use
partitionwise join here.

If we insert below data into the table, we will get wrong results for
the query above.

insert into p select 5,15;
insert into p select 15,5;

Good catch! It's quite wrong to use COALESCE(t12.val1, t12.k1) = t3.k
for partitionwise join as the COALESCE expression might as well output
the value of val1 which doesn't conform to partitioning.

I've fixed match_join_arg_to_partition_keys() to catch that case and
fail. Added a test case as well.

Please find attached updated patches.

Thank you for the fix. Will review.

Thanks
Richard

#11Justin Pryzby
pryzby@telsasoft.com
In reply to: Amit Langote (#9)
Re: d25ea01275 and partitionwise join

On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote:

Please find attached updated patches.

Tom pointed me to this thread, since we hit it in 12.0
/messages/by-id/16802.1570989962@sss.pgh.pa.us

I can't say much about the patch; there's a little typo:
"The nullability of inner relation keys prevents them to"
..should say "prevent them from".

In order to compile it against REL12, I tried to cherry-pick this one:
3373c715: Speed up finding EquivalenceClasses for a given set of rels

But then it crashes in check-world (possibly due to misapplied hunks).

--
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581

#12Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#11)
Re: d25ea01275 and partitionwise join

On Sun, Oct 13, 2019 at 03:02:17PM -0500, Justin Pryzby wrote:

On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote:

Please find attached updated patches.

Tom pointed me to this thread, since we hit it in 12.0
/messages/by-id/16802.1570989962@sss.pgh.pa.us

I can't say much about the patch; there's a little typo:
"The nullability of inner relation keys prevents them to"
..should say "prevent them from".

In order to compile it against REL12, I tried to cherry-pick this one:
3373c715: Speed up finding EquivalenceClasses for a given set of rels

But then it crashes in check-world (possibly due to misapplied hunks).

I did it again paying more attention and got it to pass.

The PWJ + FULL JOIN query seems ok now.

But I'll leave PWJ disabled until I can look more closely.

$ PGOPTIONS='-c max_parallel_workers_per_gather=0 -c enable_mergejoin=off -c enable_hashagg=off -c enable_partitionwise_join=on' psql postgres -f tmp/sql-2019-10-11.1
SET
Nested Loop (cost=80106964.13..131163200.28 rows=2226681567 width=6)
Join Filter: ((s.site_location = ''::text) OR ((s.site_office)::integer = ((COALESCE(t1.site_id, t2.site_id))::integer)))
-> Group (cost=80106964.13..80837945.46 rows=22491733 width=12)
Group Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer)
-> Merge Append (cost=80106964.13..80613028.13 rows=22491733 width=12)
Sort Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer)
-> Group (cost=25494496.54..25633699.28 rows=11136219 width=12)
Group Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer)
-> Sort (cost=25494496.54..25522337.09 rows=11136219 width=12)
Sort Key: (COALESCE(t1.start_time, t2.start_time)), ((COALESCE(t1.site_id, t2.site_id))::integer)
-> Hash Full Join (cost=28608.75..24191071.36 rows=11136219 width=12)
Hash Cond: ((t1.start_time = t2.start_time) AND (t1.site_id = t2.site_id))
Filter: ((COALESCE(t1.start_time, t2.start_time) >= '2019-10-01 00:00:00'::timestamp without time zone) AND (COALESCE(t1.start_time, t2.start_time) < '2019-10-01 01:00:00'::timestamp without time zone))
-> Seq Scan on t1 (cost=0.00..14495.10 rows=940910 width=10)
-> Hash (cost=14495.10..14495.10 rows=940910 width=10)
-> Seq Scan on t1 t2 (cost=0.00..14495.10 rows=940910 width=10)
-> Group (cost=54612467.58..54754411.51 rows=11355514 width=12)
Group Key: (COALESCE(t1_1.start_time, t2_1.start_time)), ((COALESCE(t1_1.site_id, t2_1.site_id))::integer)
-> Sort (cost=54612467.58..54640856.37 rows=11355514 width=12)
Sort Key: (COALESCE(t1_1.start_time, t2_1.start_time)), ((COALESCE(t1_1.site_id, t2_1.site_id))::integer)
-> Hash Full Join (cost=28608.75..53281777.94 rows=11355514 width=12)
Hash Cond: ((t1_1.start_time = t2_1.start_time) AND (t1_1.site_id = t2_1.site_id))
Filter: ((COALESCE(t1_1.start_time, t2_1.start_time) >= '2019-10-01 00:00:00'::timestamp without time zone) AND (COALESCE(t1_1.start_time, t2_1.start_time) < '2019-10-01 01:00:00'::timestamp without time zone))
-> Seq Scan on t2 t1_1 (cost=0.00..14495.10 rows=940910 width=10)
-> Hash (cost=14495.10..14495.10 rows=940910 width=10)
-> Seq Scan on t2 t2_1 (cost=0.00..14495.10 rows=940910 width=10)
-> Materialize (cost=0.00..2.48 rows=99 width=6)
-> Seq Scan on s (cost=0.00..1.99 rows=99 width=6)

--
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581

#13Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Justin Pryzby (#11)
Re: d25ea01275 and partitionwise join

Hi Justin,

On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote:

Please find attached updated patches.

Tom pointed me to this thread, since we hit it in 12.0
/messages/by-id/16802.1570989962@sss.pgh.pa.us

I can't say much about the patch; there's a little typo:
"The nullability of inner relation keys prevents them to"
..should say "prevent them from".

Thanks, will fix.

Regards,
Amit

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#13)
Re: d25ea01275 and partitionwise join

Amit Langote <amitlangote09@gmail.com> writes:

On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I can't say much about the patch; there's a little typo:
"The nullability of inner relation keys prevents them to"
..should say "prevent them from".

Thanks, will fix.

Just to leave a breadcrumb in this thread --- the planner failure
induced by d25ea01275 has been fixed in 529ebb20a. The difficulty
with multiway full joins that Amit started this thread with remains
open, but I imagine the posted patches will need rebasing over
529ebb20a.

regards, tom lane

#15Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#14)
Re: d25ea01275 and partitionwise join

On Wed, Nov 6, 2019 at 2:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Mon, Oct 14, 2019 at 5:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

I can't say much about the patch; there's a little typo:
"The nullability of inner relation keys prevents them to"
..should say "prevent them from".

Thanks, will fix.

Just to leave a breadcrumb in this thread --- the planner failure
induced by d25ea01275 has been fixed in 529ebb20a. The difficulty
with multiway full joins that Amit started this thread with remains
open, but I imagine the posted patches will need rebasing over
529ebb20a.

Here are the rebased patches.

I've divided the patch containing only cosmetic improvements into two
patches, of which the latter only moves around code. Patch 0003
implements the actual fix to the problem with multiway joins.

Thanks,
Amit

Attachments:

0001-Some-cosmetic-improvements-to-partitionwise-join-cod.patchapplication/octet-stream; name=0001-Some-cosmetic-improvements-to-partitionwise-join-cod.patchDownload+109-58
0002-Move-some-code-from-joinrel.c-to-relnode.c.patchapplication/octet-stream; name=0002-Move-some-code-from-joinrel.c-to-relnode.c.patchDownload+180-181
0003-Fix-partitionwise-join-to-handle-FULL-JOINs-correctl.patchapplication/octet-stream; name=0003-Fix-partitionwise-join-to-handle-FULL-JOINs-correctl.patchDownload+241-17
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#15)
Re: d25ea01275 and partitionwise join

Amit Langote <amitlangote09@gmail.com> writes:

On Wed, Nov 6, 2019 at 2:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Just to leave a breadcrumb in this thread --- the planner failure
induced by d25ea01275 has been fixed in 529ebb20a. The difficulty
with multiway full joins that Amit started this thread with remains
open, but I imagine the posted patches will need rebasing over
529ebb20a.

Here are the rebased patches.

The cfbot shows these patches as failing regression tests. I think
it is just cosmetic fallout from 6ef77cf46 having changed EXPLAIN's
choices of table alias names; but please look closer to confirm,
and post updated patches.

regards, tom lane

#17Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#16)
Re: d25ea01275 and partitionwise join

On Sat, Feb 29, 2020 at 8:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Wed, Nov 6, 2019 at 2:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Just to leave a breadcrumb in this thread --- the planner failure
induced by d25ea01275 has been fixed in 529ebb20a. The difficulty
with multiway full joins that Amit started this thread with remains
open, but I imagine the posted patches will need rebasing over
529ebb20a.

Here are the rebased patches.

The cfbot shows these patches as failing regression tests. I think
it is just cosmetic fallout from 6ef77cf46 having changed EXPLAIN's
choices of table alias names; but please look closer to confirm,
and post updated patches.

Thanks for notifying.

Checked and indeed fallout from 6ef77cf46 seems to be the reason a
test is failing.

Updated patches attached.

Thanks,
Amit

Attachments:

v5-0002-Move-some-code-from-joinrel.c-to-relnode.c.patchapplication/octet-stream; name=v5-0002-Move-some-code-from-joinrel.c-to-relnode.c.patchDownload+180-181
v5-0001-Some-cosmetic-improvements-to-partitionwise-join-.patchapplication/octet-stream; name=v5-0001-Some-cosmetic-improvements-to-partitionwise-join-.patchDownload+109-58
v5-0003-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchapplication/octet-stream; name=v5-0003-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchDownload+241-17
#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#17)
Re: d25ea01275 and partitionwise join

Amit Langote <amitlangote09@gmail.com> writes:

Updated patches attached.

I looked through these and committed 0001+0002, with some further
comment-polishing. However, I have no faith at all in 0003. It is
blithely digging through COALESCE expressions with no concern for
whether they came from full joins or not, or whether the other values
being coalesced to might completely change the semantics. Digging
through PlaceHolderVars scares me even more; what's that got to do
with the problem, anyway? So while this might fix the complained-of
issue of failing to use a partitionwise join, I think it wouldn't be
hard to create examples that it would incorrectly turn into
partitionwise joins.

I wonder whether it'd be feasible to fix the problem by going in the
other direction; that is, while constructing the nullable_partexprs
lists for a full join, add synthesized COALESCE() expressions for the
output columns (by wrapping COALESCE around copies of the input rels'
partition expressions), and then not need to do anything special in
match_expr_to_partition_keys. We'd still need to convince ourselves
that this did the right thing and not any of the wrong things, but
I think it might be easier to prove it that way.

regards, tom lane

#19Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#18)
Re: d25ea01275 and partitionwise join

On Sat, Apr 4, 2020 at 6:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Langote <amitlangote09@gmail.com> writes:

Updated patches attached.

I looked through these and committed 0001+0002, with some further
comment-polishing. However, I have no faith at all in 0003.

Thanks for the review.

It is
blithely digging through COALESCE expressions with no concern for
whether they came from full joins or not, or whether the other values
being coalesced to might completely change the semantics. Digging
through PlaceHolderVars scares me even more; what's that got to do
with the problem, anyway? So while this might fix the complained-of
issue of failing to use a partitionwise join, I think it wouldn't be
hard to create examples that it would incorrectly turn into
partitionwise joins.

I wonder whether it'd be feasible to fix the problem by going in the
other direction; that is, while constructing the nullable_partexprs
lists for a full join, add synthesized COALESCE() expressions for the
output columns (by wrapping COALESCE around copies of the input rels'
partition expressions), and then not need to do anything special in
match_expr_to_partition_keys. We'd still need to convince ourselves
that this did the right thing and not any of the wrong things, but
I think it might be easier to prove it that way.

Okay, I tried that in the updated patch. I didn't try to come up with
examples that might break it though.

--
Thank you,

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v6-0001-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchapplication/octet-stream; name=v6-0001-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchDownload+214-21
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#19)
Re: d25ea01275 and partitionwise join

Amit Langote <amitlangote09@gmail.com> writes:

Okay, I tried that in the updated patch. I didn't try to come up with
examples that might break it though.

I looked through this.

* Wasn't excited about inventing makeCoalesceExpr(); the fact that it only
had two potential call sites seemed to make it not worth the trouble.
Plus, as defined, it could not handle the general case of COALESCE, which
can have N arguments; so that seemed like a recipe for confusion.

* I think your logic for building the coalesce combinations was just
wrong. We need combinations of left-hand inputs with right-hand inputs,
not left-hand with left-hand or right-hand with right-hand. Also the
nesting should already have happened in the inputs, we don't need to
try to generate it here. The looping code was pretty messy as well.

* I don't think using partopcintype is necessarily right; that could be
a polymorphic type, for instance. Safer to copy the type of the input
expressions. Likely we could have got away with using partcollation,
but for consistency I copied that as well.

* You really need to update the data structure definitional comments
when you make a change like this.

* I did not like putting a test case that requires
enable_partitionwise_aggregate in the partition_join test; that seems
misplaced. But it's not necessary to the test, is it?

* I did not follow the point of your second test case. The WHERE
constraint on p1.a allows the planner to strength-reduce the joins,
which is why there's no full join in that explain result, but then
we aren't going to get to this code at all.

I repaired the above in the attached.

I'm actually sort of pleasantly surprised that this worked; I was
not sure that building COALESCEs like this would provide the result
we needed. But it seems okay -- it does fix the behavior in this
3-way test case, as well as the 4-way join you showed at the top
of the thread. It's fairly dependent on the fact that the planner
won't rearrange full joins; otherwise, the COALESCE structures we
build here might not match those made at parse time. But that's
not likely to change anytime soon; and this is hardly the only
place that would break, so I'm not sweating about it. (I have
some vague ideas about getting rid of the COALESCEs as part of
the Var redefinition I've been muttering about, anyway; there
might be a cleaner fix for this if that happens.)

Anyway, I think this is probably OK for now. Given that the
nullable_partexprs lists are only used in one place, it's pretty
hard to see how it would break anything.

regards, tom lane

Attachments:

v7-0001-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.patchtext/x-diff; charset=us-ascii; name*0=v7-0001-Fix-partitionwise-join-to-handle-FULL-JOINs-corre.p; name*1=atchDownload+92-2
#21Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#21)
#23Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#23)
#25Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#25)
#27Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Tom Lane (#26)
#28Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#26)