A problem about partitionwise join

Started by Richard Guoover 6 years ago48 messages
#1Richard Guo
riguo@pivotal.io

Hi All,

To generate partitionwise join, we need to make sure there exists an
equi-join condition for each pair of partition keys, which is performed
by have_partkey_equi_join(). This makes sense and works well.

But if, let's say, one certain pair of partition keys (foo.k = bar.k)
has formed an equivalence class containing consts, no join clause would
be generated for it, since we have already generated 'foo.k = const' and
'bar.k = const' and pushed them into the proper restrictions earlier.

This will make partitionwise join fail to be planned if there are
multiple partition keys and the pushed-down restrictions 'xxx = const'
fail to prune away any partitions.

Consider the examples below:

create table p (k1 int, k2 int, val int) partition by range(k1,k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

If we are joining on each pair of partition keys, we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2;
QUERY PLAN
----------------------------------------------------------------------
Append
-> Hash Join
Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2))
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
-> Hash Join
Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2))
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
(11 rows)

But if we add another qual 'foo.k2 = const', we will be unable to
generate partitionwise join any more, because have_partkey_equi_join()
thinks not every partition key has an equi-join condition.

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2
and foo.k2 = 16;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k1 = bar.k1)
-> Append
-> Seq Scan on p_1 foo
Filter: (k2 = 16)
-> Seq Scan on p_2 foo_1
Filter: (k2 = 16)
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (k2 = 16)
-> Seq Scan on p_2 bar_1
Filter: (k2 = 16)
(13 rows)

Is this a problem?

Thanks
Richard

#2Amit Langote
amitlangote09@gmail.com
In reply to: Richard Guo (#1)
Re: A problem about partitionwise join

Hi Richard,

On Mon, Aug 26, 2019 at 6:33 PM Richard Guo <riguo@pivotal.io> wrote:

Hi All,

To generate partitionwise join, we need to make sure there exists an
equi-join condition for each pair of partition keys, which is performed
by have_partkey_equi_join(). This makes sense and works well.

But if, let's say, one certain pair of partition keys (foo.k = bar.k)
has formed an equivalence class containing consts, no join clause would
be generated for it, since we have already generated 'foo.k = const' and
'bar.k = const' and pushed them into the proper restrictions earlier.

This will make partitionwise join fail to be planned if there are
multiple partition keys and the pushed-down restrictions 'xxx = const'
fail to prune away any partitions.

Consider the examples below:

create table p (k1 int, k2 int, val int) partition by range(k1,k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

If we are joining on each pair of partition keys, we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2;
QUERY PLAN
----------------------------------------------------------------------
Append
-> Hash Join
Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2))
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
-> Hash Join
Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2))
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
(11 rows)

But if we add another qual 'foo.k2 = const', we will be unable to
generate partitionwise join any more, because have_partkey_equi_join()
thinks not every partition key has an equi-join condition.

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2 and foo.k2 = 16;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k1 = bar.k1)
-> Append
-> Seq Scan on p_1 foo
Filter: (k2 = 16)
-> Seq Scan on p_2 foo_1
Filter: (k2 = 16)
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (k2 = 16)
-> Seq Scan on p_2 bar_1
Filter: (k2 = 16)
(13 rows)

Is this a problem?

Perhaps. Maybe it has to do with the way have_partkey_equi_join() has
been coded. If it was coded such that it figured out on its own that
the equivalence (foo.k2, bar.k2, ...) does exist, then that would
allow partitionwise join to occur, which I think would be OK to do.
But maybe I'm missing something.

Thanks,
Amit

#3Richard Guo
riguo@pivotal.io
In reply to: Amit Langote (#2)
Re: A problem about partitionwise join

On Tue, Aug 27, 2019 at 8:51 AM Amit Langote <amitlangote09@gmail.com>
wrote:

Hi Richard,

On Mon, Aug 26, 2019 at 6:33 PM Richard Guo <riguo@pivotal.io> wrote:

Hi All,

To generate partitionwise join, we need to make sure there exists an
equi-join condition for each pair of partition keys, which is performed
by have_partkey_equi_join(). This makes sense and works well.

But if, let's say, one certain pair of partition keys (foo.k = bar.k)
has formed an equivalence class containing consts, no join clause would
be generated for it, since we have already generated 'foo.k = const' and
'bar.k = const' and pushed them into the proper restrictions earlier.

This will make partitionwise join fail to be planned if there are
multiple partition keys and the pushed-down restrictions 'xxx = const'
fail to prune away any partitions.

Consider the examples below:

create table p (k1 int, k2 int, val int) partition by range(k1,k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

If we are joining on each pair of partition keys, we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 =

bar.k2;

QUERY PLAN
----------------------------------------------------------------------
Append
-> Hash Join
Hash Cond: ((foo.k1 = bar.k1) AND (foo.k2 = bar.k2))
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
-> Hash Join
Hash Cond: ((foo_1.k1 = bar_1.k1) AND (foo_1.k2 = bar_1.k2))
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
(11 rows)

But if we add another qual 'foo.k2 = const', we will be unable to
generate partitionwise join any more, because have_partkey_equi_join()
thinks not every partition key has an equi-join condition.

# explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 =

bar.k2 and foo.k2 = 16;

QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k1 = bar.k1)
-> Append
-> Seq Scan on p_1 foo
Filter: (k2 = 16)
-> Seq Scan on p_2 foo_1
Filter: (k2 = 16)
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (k2 = 16)
-> Seq Scan on p_2 bar_1
Filter: (k2 = 16)
(13 rows)

Is this a problem?

Perhaps. Maybe it has to do with the way have_partkey_equi_join() has
been coded. If it was coded such that it figured out on its own that
the equivalence (foo.k2, bar.k2, ...) does exist, then that would
allow partitionwise join to occur, which I think would be OK to do.
But maybe I'm missing something.

This should be caused by how we deduce join clauses from equivalence
classes. ECs containing consts will not be considered so we cannot
generate (foo.k2 = bar.k2) for the query above.

In addition, when generating join clauses from equivalence classes, we
only select the joinclause with the 'best score', or the first
joinclause with a score of 3. This may make us miss some joinclause on
partition keys.

Check the query below as a more illustrative example:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val;
QUERY PLAN
-----------------------------------------
Append
-> Hash Join
Hash Cond: (foo.k = bar.k)
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
Filter: (k = val)
-> Hash Join
Hash Cond: (foo_1.k = bar_1.k)
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
Filter: (k = val)
(13 rows)

But if we exchange the order of the two quals to 'foo.k = bar.val and
foo.k = bar.k', then partitionwise join cannot be generated any more,
because we only have joinclause 'foo.k = bar.val' as it first reached
score of 3. We have missed the joinclause on the partition key although
it does exist.

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k = bar.val)
-> Append
-> Seq Scan on p_1 foo
-> Seq Scan on p_2 foo_1
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (val = k)
-> Seq Scan on p_2 bar_1
Filter: (val = k)
(11 rows)

Thanks
Richard

#4Etsuro Fujita
etsuro.fujita@gmail.com
In reply to: Richard Guo (#3)
Re: A problem about partitionwise join

Hi,

On Tue, Aug 27, 2019 at 4:57 PM Richard Guo <riguo@pivotal.io> wrote:

Check the query below as a more illustrative example:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val;
QUERY PLAN
-----------------------------------------
Append
-> Hash Join
Hash Cond: (foo.k = bar.k)
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
Filter: (k = val)
-> Hash Join
Hash Cond: (foo_1.k = bar_1.k)
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
Filter: (k = val)
(13 rows)

But if we exchange the order of the two quals to 'foo.k = bar.val and
foo.k = bar.k', then partitionwise join cannot be generated any more,
because we only have joinclause 'foo.k = bar.val' as it first reached
score of 3. We have missed the joinclause on the partition key although
it does exist.

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k = bar.val)
-> Append
-> Seq Scan on p_1 foo
-> Seq Scan on p_2 foo_1
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (val = k)
-> Seq Scan on p_2 bar_1
Filter: (val = k)
(11 rows)

I think it would be nice if we can address this issue.

Best regards,
Etsuro Fujita

#5Richard Guo
riguo@pivotal.io
In reply to: Etsuro Fujita (#4)
1 attachment(s)
Re: A problem about partitionwise join

On Wed, Aug 28, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:

Hi,

On Tue, Aug 27, 2019 at 4:57 PM Richard Guo <riguo@pivotal.io> wrote:

Check the query below as a more illustrative example:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.k and foo.k =

bar.val;

QUERY PLAN
-----------------------------------------
Append
-> Hash Join
Hash Cond: (foo.k = bar.k)
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
Filter: (k = val)
-> Hash Join
Hash Cond: (foo_1.k = bar_1.k)
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
Filter: (k = val)
(13 rows)

But if we exchange the order of the two quals to 'foo.k = bar.val and
foo.k = bar.k', then partitionwise join cannot be generated any more,
because we only have joinclause 'foo.k = bar.val' as it first reached
score of 3. We have missed the joinclause on the partition key although
it does exist.

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.val and foo.k =

bar.k;

QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k = bar.val)
-> Append
-> Seq Scan on p_1 foo
-> Seq Scan on p_2 foo_1
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (val = k)
-> Seq Scan on p_2 bar_1
Filter: (val = k)
(11 rows)

I think it would be nice if we can address this issue.

Thank you.

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Any comments are welcome!

Thanks
Richard

Attachments:

v1-0001-Fix-up-partitionwise-join.patchapplication/octet-stream; name=v1-0001-Fix-up-partitionwise-join.patchDownload
From 4f8c49b871386b45393d93ff8c220214a44c7650 Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Wed, 28 Aug 2019 14:48:54 +0000
Subject: [PATCH] Fix up partitionwise join.

---
 src/backend/optimizer/path/equivclass.c      | 102 +++++++++++++++++++++++++++
 src/backend/optimizer/util/relnode.c         |  19 +++--
 src/include/optimizer/paths.h                |   4 ++
 src/test/regress/expected/partition_join.out |  38 ++++++++++
 src/test/regress/sql/partition_join.sql      |   4 ++
 5 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index ccc07ba..d6eb739 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1197,6 +1197,108 @@ generate_join_implied_equalities(PlannerInfo *root,
 }
 
 /*
+ * generate_join_implied_equalities_for_all
+ *	  Create any EC-derived joinclauses of form 'outer_em = inner_em'.
+ *
+ * This is used when building partition info for joinrel.
+ */
+List *
+generate_join_implied_equalities_for_all(PlannerInfo *root,
+										 Relids join_relids,
+										 Relids outer_relids,
+										 Relids inner_relids)
+{
+	List	   *result = NIL;
+	Bitmapset * matching_ecs;
+	int			i;
+
+	/*
+	 * Get all eclasses in common between inner_relids and outer_relids
+	 */
+	matching_ecs = get_common_eclass_indexes(root, inner_relids,
+											 outer_relids);
+
+	i = -1;
+	while ((i = bms_next_member(matching_ecs, i)) >= 0)
+	{
+		EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, i);
+		List	   *outer_members = NIL;
+		List	   *inner_members = NIL;
+		ListCell   *lc1;
+
+		/* Do not consider this EC if it's ec_broken */
+		if (ec->ec_broken)
+			continue;
+
+		/* Single-member ECs won't generate any deductions */
+		if (list_length(ec->ec_members) <= 1)
+			continue;
+
+		/*
+		 * First, scan the EC to identify member values that are computable at the
+		 * outer rel or at the inner rel.
+		 */
+		foreach(lc1, ec->ec_members)
+		{
+			EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1);
+
+			/*
+			 * We don't need to check explicitly for child EC members.  This test
+			 * against join_relids will cause them to be ignored except when
+			 * considering a child inner rel, which is what we want.
+			 */
+			if (!bms_is_subset(cur_em->em_relids, join_relids))
+				continue;			/* not computable yet, or wrong child */
+
+			if (bms_is_subset(cur_em->em_relids, outer_relids))
+				outer_members = lappend(outer_members, cur_em);
+			else if (bms_is_subset(cur_em->em_relids, inner_relids))
+				inner_members = lappend(inner_members, cur_em);
+		}
+
+		/*
+		 * First, select the joinclause if needed.  We can equate any one outer
+		 * member to any one inner member, since we know this EC is not
+		 * ec_broken.
+		 */
+		if (outer_members && inner_members)
+		{
+			RestrictInfo *rinfo;
+
+			foreach(lc1, outer_members)
+			{
+				EquivalenceMember *outer_em = (EquivalenceMember *) lfirst(lc1);
+				ListCell   *lc2;
+
+				foreach(lc2, inner_members)
+				{
+					EquivalenceMember *inner_em = (EquivalenceMember *) lfirst(lc2);
+					Oid			eq_op;
+
+					eq_op = select_equality_operator(ec,
+													 outer_em->em_datatype,
+													 inner_em->em_datatype);
+					if (!OidIsValid(eq_op))
+						continue;
+
+					/*
+					 * Create clause, setting parent_ec to mark it as redundant with other
+					 * joinclauses
+					 */
+					rinfo = create_join_clause(root, ec, eq_op,
+											   outer_em, inner_em,
+											   ec);
+
+					result = lappend(result, rinfo);
+				}
+			}
+		}
+	}
+
+	return result;
+}
+
+/*
  * generate_join_implied_equalities_for_ecs
  *	  As above, but consider only the listed ECs.
  */
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 8541538..07b49d3 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -55,7 +55,7 @@ static List *subbuild_joinrel_joinlist(RelOptInfo *joinrel,
 static void set_foreign_rel_properties(RelOptInfo *joinrel,
 									   RelOptInfo *outer_rel, RelOptInfo *inner_rel);
 static void add_join_rel(PlannerInfo *root, RelOptInfo *joinrel);
-static void build_joinrel_partition_info(RelOptInfo *joinrel,
+static void build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
 										 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 										 List *restrictlist, JoinType jointype);
 static void build_child_join_reltarget(PlannerInfo *root,
@@ -706,7 +706,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->has_eclass_joins = has_relevant_eclass_joinclause(root, joinrel);
 
 	/* Store the partition information. */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel, restrictlist,
 								 sjinfo->jointype);
 
 	/*
@@ -870,7 +870,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->has_eclass_joins = parent_joinrel->has_eclass_joins;
 
 	/* Is the join between partitions itself partitioned? */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel, restrictlist,
 								 jointype);
 
 	/* Child joinrel is parallel safe if parent is parallel safe. */
@@ -1600,9 +1600,9 @@ find_param_path_info(RelOptInfo *rel, Relids required_outer)
  *		the join relation.
  */
 static void
-build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
-							 RelOptInfo *inner_rel, List *restrictlist,
-							 JoinType jointype)
+build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
+							 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
+							 List *restrictlist, JoinType jointype)
 {
 	int			partnatts;
 	int			cnt;
@@ -1615,6 +1615,13 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		return;
 	}
 
+	restrictlist =
+		list_concat_unique_ptr(generate_join_implied_equalities_for_all(root,
+																		joinrel->relids,
+																		outer_rel->relids,
+																		inner_rel->relids),
+				restrictlist);
+
 	/*
 	 * We can only consider this join as an input to further partitionwise
 	 * joins if (a) the input relations are partitioned and have
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137..b640b66 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -140,6 +140,10 @@ extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
 											  Relids outer_relids,
 											  RelOptInfo *inner_rel);
+extern List *generate_join_implied_equalities_for_all(PlannerInfo *root,
+													  Relids join_relids,
+													  Relids outer_relids,
+													  Relids inner_relids);
 extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  List *eclasses,
 													  Relids join_relids,
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 1296edc..e39cf11 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,44 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                         QUERY PLAN                          
+-------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1.a = t2.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1
+               ->  Sort
+                     Sort Key: t2.b
+                     ->  Seq Scan on prt2_p1 t2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_1.a = t2_1.a)
+               ->  Seq Scan on prt1_p2 t1_1
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p3 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_2
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index db9a6b4..dfa2515 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-- 
2.7.4

#6Etsuro Fujita
etsuro.fujita@gmail.com
In reply to: Richard Guo (#5)
Re: A problem about partitionwise join

On Thu, Aug 29, 2019 at 6:45 PM Richard Guo <riguo@pivotal.io> wrote:

On Wed, Aug 28, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Tue, Aug 27, 2019 at 4:57 PM Richard Guo <riguo@pivotal.io> wrote:

Check the query below as a more illustrative example:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.k and foo.k = bar.val;
QUERY PLAN
-----------------------------------------
Append
-> Hash Join
Hash Cond: (foo.k = bar.k)
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
Filter: (k = val)
-> Hash Join
Hash Cond: (foo_1.k = bar_1.k)
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
Filter: (k = val)
(13 rows)

But if we exchange the order of the two quals to 'foo.k = bar.val and
foo.k = bar.k', then partitionwise join cannot be generated any more,
because we only have joinclause 'foo.k = bar.val' as it first reached
score of 3. We have missed the joinclause on the partition key although
it does exist.

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k = bar.val)
-> Append
-> Seq Scan on p_1 foo
-> Seq Scan on p_2 foo_1
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (val = k)
-> Seq Scan on p_2 bar_1
Filter: (val = k)
(11 rows)

I think it would be nice if we can address this issue.

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Thank you for the patch! Will review. Could you add the patch to the
upcoming CF so that it doesn’t get lost?

Best regards,
Etsuro Fujita

#7Richard Guo
riguo@pivotal.io
In reply to: Etsuro Fujita (#6)
Re: A problem about partitionwise join

On Fri, Aug 30, 2019 at 2:08 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:

On Thu, Aug 29, 2019 at 6:45 PM Richard Guo <riguo@pivotal.io> wrote:

On Wed, Aug 28, 2019 at 6:49 PM Etsuro Fujita <etsuro.fujita@gmail.com>

wrote:

On Tue, Aug 27, 2019 at 4:57 PM Richard Guo <riguo@pivotal.io> wrote:

Check the query below as a more illustrative example:

create table p (k int, val int) partition by range(k);
create table p_1 partition of p for values from (1) to (10);
create table p_2 partition of p for values from (10) to (100);

If we use quals 'foo.k = bar.k and foo.k = bar.val', we can generate
partitionwise join:

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.k and foo.k =

bar.val;

QUERY PLAN
-----------------------------------------
Append
-> Hash Join
Hash Cond: (foo.k = bar.k)
-> Seq Scan on p_1 foo
-> Hash
-> Seq Scan on p_1 bar
Filter: (k = val)
-> Hash Join
Hash Cond: (foo_1.k = bar_1.k)
-> Seq Scan on p_2 foo_1
-> Hash
-> Seq Scan on p_2 bar_1
Filter: (k = val)
(13 rows)

But if we exchange the order of the two quals to 'foo.k = bar.val and
foo.k = bar.k', then partitionwise join cannot be generated any more,
because we only have joinclause 'foo.k = bar.val' as it first reached
score of 3. We have missed the joinclause on the partition key

although

it does exist.

# explain (costs off)
select * from p as foo join p as bar on foo.k = bar.val and foo.k =

bar.k;

QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k = bar.val)
-> Append
-> Seq Scan on p_1 foo
-> Seq Scan on p_2 foo_1
-> Hash
-> Append
-> Seq Scan on p_1 bar
Filter: (val = k)
-> Seq Scan on p_2 bar_1
Filter: (val = k)
(11 rows)

I think it would be nice if we can address this issue.

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Thank you for the patch! Will review. Could you add the patch to the
upcoming CF so that it doesn’t get lost?

Added this patch: https://commitfest.postgresql.org/24/2266/

Thanks
Richard

#8Etsuro Fujita
etsuro.fujita@gmail.com
In reply to: Richard Guo (#7)
Re: A problem about partitionwise join

On Fri, Aug 30, 2019 at 12:15 PM Richard Guo <riguo@pivotal.io> wrote:

On Fri, Aug 30, 2019 at 2:08 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Thu, Aug 29, 2019 at 6:45 PM Richard Guo <riguo@pivotal.io> wrote:

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Could you add the patch to the
upcoming CF so that it doesn’t get lost?

Added this patch: https://commitfest.postgresql.org/24/2266/

Thanks!

Best regards,
Etsuro Fujita

#9Alvaro Herrera from 2ndQuadrant
alvherre@alvh.no-ip.org
In reply to: Richard Guo (#5)
Re: A problem about partitionwise join

So in this patch, the input restrictlist is modified to include the
clauses generated by generate_join_implied_equalities_for_all. That
doesn't seem okay -- doesn't it affect downstream usage of the
restrictlist in the caller of set_joinrel_size_estimates?

I wonder if it's possible to do this by using the ECs directly in
have_partkey_equi_join instead of using them to create fake join
clauses.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Richard Guo
riguo@pivotal.io
In reply to: Alvaro Herrera from 2ndQuadrant (#9)
Re: A problem about partitionwise join

Hi Alvaro,

Thank you for reviewing this patch.

On Wed, Sep 11, 2019 at 4:48 AM Alvaro Herrera from 2ndQuadrant <
alvherre@alvh.no-ip.org> wrote:

So in this patch, the input restrictlist is modified to include the
clauses generated by generate_join_implied_equalities_for_all. That
doesn't seem okay -- doesn't it affect downstream usage of the
restrictlist in the caller of set_joinrel_size_estimates?

Actually the joinclauses generated by
generate_join_implied_equalities_for_all only affects the restrictlist
used in have_partkey_equi_join to check equi-join conditions for
partition keys. The input restrictlist would not be altered.

I wonder if it's possible to do this by using the ECs directly in
have_partkey_equi_join instead of using them to create fake join
clauses.

Hmm.. I thought about this option and at last figured that what we need
to do in have_partkey_equi_join with the ECs is actually the same as in
generate_join_implied_equalities_for_all. Maybe I didn't think it
correctly.

Thanks
Richard

#11Dilip Kumar
dilipbalaut@gmail.com
In reply to: Richard Guo (#5)
Re: A problem about partitionwise join

On Thu, Aug 29, 2019 at 3:15 PM Richard Guo <riguo@pivotal.io> wrote:

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Any comments are welcome!

 /*
+ * generate_join_implied_equalities_for_all
+ *   Create any EC-derived joinclauses of form 'outer_em = inner_em'.
+ *
+ * This is used when building partition info for joinrel.
+ */
+List *
+generate_join_implied_equalities_for_all(PlannerInfo *root,
+ Relids join_relids,
+ Relids outer_relids,
+ Relids inner_relids)

I think we need to have more detailed comments about why we need this
separate function, we can also explain that
generate_join_implied_equalities function will avoid
the join clause if EC has the constant but for partition-wise join, we
need that clause too.

+ while ((i = bms_next_member(matching_ecs, i)) >= 0)
+ {
+ EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, i);
+ List    *outer_members = NIL;
+ List    *inner_members = NIL;
+ ListCell   *lc1;
+
+ /* Do not consider this EC if it's ec_broken */
+ if (ec->ec_broken)
+ continue;
+
+ /* Single-member ECs won't generate any deductions */
+ if (list_length(ec->ec_members) <= 1)
+ continue;
+

I am wondering isn't it possible to just process the missing join
clause? I mean 'generate_join_implied_equalities' has only skipped
the ECs which has const so
can't we create join clause only for those ECs and append it the
"Restrictlist" we already have? I might be missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#12Richard Guo
riguo@pivotal.io
In reply to: Dilip Kumar (#11)
Re: A problem about partitionwise join

Hi Dilip,

Thank you for reviewing this patch.

On Fri, Sep 20, 2019 at 12:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 29, 2019 at 3:15 PM Richard Guo <riguo@pivotal.io> wrote:

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Any comments are welcome!

/*
+ * generate_join_implied_equalities_for_all
+ *   Create any EC-derived joinclauses of form 'outer_em = inner_em'.
+ *
+ * This is used when building partition info for joinrel.
+ */
+List *
+generate_join_implied_equalities_for_all(PlannerInfo *root,
+ Relids join_relids,
+ Relids outer_relids,
+ Relids inner_relids)

I think we need to have more detailed comments about why we need this
separate function, we can also explain that
generate_join_implied_equalities function will avoid
the join clause if EC has the constant but for partition-wise join, we
need that clause too.

Thank you for the suggestion.

+ while ((i = bms_next_member(matching_ecs, i)) >= 0)
+ {
+ EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes,
i);
+ List    *outer_members = NIL;
+ List    *inner_members = NIL;
+ ListCell   *lc1;
+
+ /* Do not consider this EC if it's ec_broken */
+ if (ec->ec_broken)
+ continue;
+
+ /* Single-member ECs won't generate any deductions */
+ if (list_length(ec->ec_members) <= 1)
+ continue;
+

I am wondering isn't it possible to just process the missing join
clause? I mean 'generate_join_implied_equalities' has only skipped
the ECs which has const so
can't we create join clause only for those ECs and append it the
"Restrictlist" we already have? I might be missing something?

For ECs without const, 'generate_join_implied_equalities' also has
skipped some join clauses since it only selects the joinclause with
'best_score' between outer members and inner members. And the missing
join clauses are needed to generate partitionwise join. That's why the
query below cannot be planned as partitionwise join, as we have missed
joinclause 'foo.k = bar.k'.

select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;

And yes 'generate_join_implied_equalities_for_all' will create join
clauses that have existed in restrictlist. I think it's OK since the
same RestrictInfo deduced from EC will share the same pointer and
list_concat_unique_ptr will make sure there are no duplicates in the
restrictlist used by have_partkey_equi_join.

Thanks
Richard

#13Dilip Kumar
dilipbalaut@gmail.com
In reply to: Richard Guo (#12)
Re: A problem about partitionwise join

On Fri, Sep 20, 2019 at 2:33 PM Richard Guo <riguo@pivotal.io> wrote:

Hi Dilip,

Thank you for reviewing this patch.

On Fri, Sep 20, 2019 at 12:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Aug 29, 2019 at 3:15 PM Richard Guo <riguo@pivotal.io> wrote:

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Any comments are welcome!

/*
+ * generate_join_implied_equalities_for_all
+ *   Create any EC-derived joinclauses of form 'outer_em = inner_em'.
+ *
+ * This is used when building partition info for joinrel.
+ */
+List *
+generate_join_implied_equalities_for_all(PlannerInfo *root,
+ Relids join_relids,
+ Relids outer_relids,
+ Relids inner_relids)

I think we need to have more detailed comments about why we need this
separate function, we can also explain that
generate_join_implied_equalities function will avoid
the join clause if EC has the constant but for partition-wise join, we
need that clause too.

Thank you for the suggestion.

+ while ((i = bms_next_member(matching_ecs, i)) >= 0)
+ {
+ EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, i);
+ List    *outer_members = NIL;
+ List    *inner_members = NIL;
+ ListCell   *lc1;
+
+ /* Do not consider this EC if it's ec_broken */
+ if (ec->ec_broken)
+ continue;
+
+ /* Single-member ECs won't generate any deductions */
+ if (list_length(ec->ec_members) <= 1)
+ continue;
+

I am wondering isn't it possible to just process the missing join
clause? I mean 'generate_join_implied_equalities' has only skipped
the ECs which has const so
can't we create join clause only for those ECs and append it the
"Restrictlist" we already have? I might be missing something?

For ECs without const, 'generate_join_implied_equalities' also has
skipped some join clauses since it only selects the joinclause with
'best_score' between outer members and inner members. And the missing
join clauses are needed to generate partitionwise join. That's why the
query below cannot be planned as partitionwise join, as we have missed
joinclause 'foo.k = bar.k'.

oh right. I missed that part.

select * from p as foo join p as bar on foo.k = bar.val and foo.k = bar.k;

And yes 'generate_join_implied_equalities_for_all' will create join
clauses that have existed in restrictlist. I think it's OK since the
same RestrictInfo deduced from EC will share the same pointer and
list_concat_unique_ptr will make sure there are no duplicates in the
restrictlist used by have_partkey_equi_join.

ok

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#14Etsuro Fujita
etsuro.fujita@gmail.com
In reply to: Etsuro Fujita (#6)
Re: A problem about partitionwise join

Hi Richard,

On Fri, Aug 30, 2019 at 3:08 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Thu, Aug 29, 2019 at 6:45 PM Richard Guo <riguo@pivotal.io> wrote:

Attached is a patch as an attempt to address this issue. The idea is
quite straightforward. When building partition info for joinrel, we
generate any possible EC-derived joinclauses of form 'outer_em =
inner_em', which will be used together with the original restrictlist to
check if there exists an equi-join condition for each pair of partition
keys.

Will review.

I've just started reviewing this patch. One comment I have for now
is: this is categorized into Bug Fixes, but we have a workaround at
least to the regression test case in the patch (ie, just reorder join
clauses), so this seems to me more like an improvement than a bug fix.

Sorry for the delay.

Best regards,
Etsuro Fujita

#15Michael Paquier
michael@paquier.xyz
In reply to: Etsuro Fujita (#14)
Re: A problem about partitionwise join

On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:

I've just started reviewing this patch. One comment I have for now
is: this is categorized into Bug Fixes, but we have a workaround at
least to the regression test case in the patch (ie, just reorder join
clauses), so this seems to me more like an improvement than a bug fix.

Hmm. Agreed. Changed the category and moved to next CF.
--
Michael

#16Richard Guo
riguo@pivotal.io
In reply to: Michael Paquier (#15)
Re: A problem about partitionwise join

On Fri, Nov 29, 2019 at 11:03 AM Michael Paquier <michael@paquier.xyz>
wrote:

On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:

I've just started reviewing this patch. One comment I have for now
is: this is categorized into Bug Fixes, but we have a workaround at
least to the regression test case in the patch (ie, just reorder join
clauses), so this seems to me more like an improvement than a bug fix.

Hmm. Agreed. Changed the category and moved to next CF.

Thanks Etsuro for the comment and thanks Michael for the change.

Thanks
Richard

#17Etsuro Fujita
etsuro.fujita@gmail.com
In reply to: Richard Guo (#16)
Re: A problem about partitionwise join

On Fri, Nov 29, 2019 at 12:08 PM Richard Guo <riguo@pivotal.io> wrote:

On Fri, Nov 29, 2019 at 11:03 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:

I've just started reviewing this patch. One comment I have for now
is: this is categorized into Bug Fixes, but we have a workaround at
least to the regression test case in the patch (ie, just reorder join
clauses), so this seems to me more like an improvement than a bug fix.

Hmm. Agreed. Changed the category and moved to next CF.

thanks Michael for the change.

+1

Best regards,
Etsuro Fujita

#18Richard Guo
riguo@pivotal.io
In reply to: Etsuro Fujita (#17)
1 attachment(s)
Re: A problem about partitionwise join

Rebased the patch with latest master and also addressed the test case
failure reported by PostgreSQL Patch Tester.

Thanks
Richard

On Fri, Nov 29, 2019 at 11:35 AM Etsuro Fujita <etsuro.fujita@gmail.com>
wrote:

Show quoted text

On Fri, Nov 29, 2019 at 12:08 PM Richard Guo <riguo@pivotal.io> wrote:

On Fri, Nov 29, 2019 at 11:03 AM Michael Paquier <michael@paquier.xyz>

wrote:

On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:

I've just started reviewing this patch. One comment I have for now
is: this is categorized into Bug Fixes, but we have a workaround at
least to the regression test case in the patch (ie, just reorder join
clauses), so this seems to me more like an improvement than a bug fix.

Hmm. Agreed. Changed the category and moved to next CF.

thanks Michael for the change.

+1

Best regards,
Etsuro Fujita

Attachments:

v2-0001-Fix-up-partitionwise-join.patchapplication/octet-stream; name=v2-0001-Fix-up-partitionwise-join.patchDownload
From 885b2e87f22ab939e652f6610fba97e4e49c46c6 Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Sun, 19 Jan 2020 07:44:18 +0000
Subject: [PATCH] Fix up partitionwise join.

---
 src/backend/optimizer/path/equivclass.c      | 102 +++++++++++++++++++++++++++
 src/backend/optimizer/util/relnode.c         |  19 +++--
 src/include/optimizer/paths.h                |   4 ++
 src/test/regress/expected/partition_join.out |  38 ++++++++++
 src/test/regress/sql/partition_join.sql      |   4 ++
 5 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef1254..76fd9c9 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -1269,6 +1269,108 @@ generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 }
 
 /*
+ * generate_join_implied_equalities_for_all
+ *	  Create any EC-derived joinclauses of form 'outer_em = inner_em'.
+ *
+ * This is used when building partition info for joinrel.
+ */
+List *
+generate_join_implied_equalities_for_all(PlannerInfo *root,
+										 Relids join_relids,
+										 Relids outer_relids,
+										 Relids inner_relids)
+{
+	List	   *result = NIL;
+	Bitmapset * matching_ecs;
+	int			i;
+
+	/*
+	 * Get all eclasses in common between inner_relids and outer_relids
+	 */
+	matching_ecs = get_common_eclass_indexes(root, inner_relids,
+											 outer_relids);
+
+	i = -1;
+	while ((i = bms_next_member(matching_ecs, i)) >= 0)
+	{
+		EquivalenceClass *ec = (EquivalenceClass *) list_nth(root->eq_classes, i);
+		List	   *outer_members = NIL;
+		List	   *inner_members = NIL;
+		ListCell   *lc1;
+
+		/* Do not consider this EC if it's ec_broken */
+		if (ec->ec_broken)
+			continue;
+
+		/* Single-member ECs won't generate any deductions */
+		if (list_length(ec->ec_members) <= 1)
+			continue;
+
+		/*
+		 * First, scan the EC to identify member values that are computable at the
+		 * outer rel or at the inner rel.
+		 */
+		foreach(lc1, ec->ec_members)
+		{
+			EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1);
+
+			/*
+			 * We don't need to check explicitly for child EC members.  This test
+			 * against join_relids will cause them to be ignored except when
+			 * considering a child inner rel, which is what we want.
+			 */
+			if (!bms_is_subset(cur_em->em_relids, join_relids))
+				continue;			/* not computable yet, or wrong child */
+
+			if (bms_is_subset(cur_em->em_relids, outer_relids))
+				outer_members = lappend(outer_members, cur_em);
+			else if (bms_is_subset(cur_em->em_relids, inner_relids))
+				inner_members = lappend(inner_members, cur_em);
+		}
+
+		/*
+		 * First, select the joinclause if needed.  We can equate any one outer
+		 * member to any one inner member, since we know this EC is not
+		 * ec_broken.
+		 */
+		if (outer_members && inner_members)
+		{
+			RestrictInfo *rinfo;
+
+			foreach(lc1, outer_members)
+			{
+				EquivalenceMember *outer_em = (EquivalenceMember *) lfirst(lc1);
+				ListCell   *lc2;
+
+				foreach(lc2, inner_members)
+				{
+					EquivalenceMember *inner_em = (EquivalenceMember *) lfirst(lc2);
+					Oid			eq_op;
+
+					eq_op = select_equality_operator(ec,
+													 outer_em->em_datatype,
+													 inner_em->em_datatype);
+					if (!OidIsValid(eq_op))
+						continue;
+
+					/*
+					 * Create clause, setting parent_ec to mark it as redundant with other
+					 * joinclauses
+					 */
+					rinfo = create_join_clause(root, ec, eq_op,
+											   outer_em, inner_em,
+											   ec);
+
+					result = lappend(result, rinfo);
+				}
+			}
+		}
+	}
+
+	return result;
+}
+
+/*
  * generate_join_implied_equalities for a still-valid EC
  */
 static List *
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 374f938..adf75f5 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -55,7 +55,7 @@ static List *subbuild_joinrel_joinlist(RelOptInfo *joinrel,
 static void set_foreign_rel_properties(RelOptInfo *joinrel,
 									   RelOptInfo *outer_rel, RelOptInfo *inner_rel);
 static void add_join_rel(PlannerInfo *root, RelOptInfo *joinrel);
-static void build_joinrel_partition_info(RelOptInfo *joinrel,
+static void build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
 										 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 										 List *restrictlist, JoinType jointype);
 static void build_child_join_reltarget(PlannerInfo *root,
@@ -706,7 +706,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->has_eclass_joins = has_relevant_eclass_joinclause(root, joinrel);
 
 	/* Store the partition information. */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel, restrictlist,
 								 sjinfo->jointype);
 
 	/*
@@ -870,7 +870,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->has_eclass_joins = parent_joinrel->has_eclass_joins;
 
 	/* Is the join between partitions itself partitioned? */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel, restrictlist,
 								 jointype);
 
 	/* Child joinrel is parallel safe if parent is parallel safe. */
@@ -1613,9 +1613,9 @@ find_param_path_info(RelOptInfo *rel, Relids required_outer)
  *		the join relation.
  */
 static void
-build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
-							 RelOptInfo *inner_rel, List *restrictlist,
-							 JoinType jointype)
+build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
+							 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
+							 List *restrictlist, JoinType jointype)
 {
 	int			partnatts;
 	int			cnt;
@@ -1628,6 +1628,13 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		return;
 	}
 
+	restrictlist =
+		list_concat_unique_ptr(generate_join_implied_equalities_for_all(root,
+																		joinrel->relids,
+																		outer_rel->relids,
+																		inner_rel->relids),
+				restrictlist);
+
 	/*
 	 * We can only consider this join as an input to further partitionwise
 	 * joins if (a) the input relations are partitioned and have
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd..ab19ea3 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -140,6 +140,10 @@ extern List *generate_join_implied_equalities(PlannerInfo *root,
 											  Relids join_relids,
 											  Relids outer_relids,
 											  RelOptInfo *inner_rel);
+extern List *generate_join_implied_equalities_for_all(PlannerInfo *root,
+													  Relids join_relids,
+													  Relids outer_relids,
+													  Relids inner_relids);
 extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  List *eclasses,
 													  Relids join_relids,
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index b3fbe47..20346f3 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,44 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1_1.a = t2_1.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t2_1.b
+                     ->  Seq Scan on prt2_p1 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p2 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_3.a = t2_3.a)
+               ->  Seq Scan on prt1_p3 t1_3
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_3
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 575ba7b..30c981e 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,10 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-- 
2.7.4

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#18)
1 attachment(s)
Re: A problem about partitionwise join

Richard Guo <riguo@pivotal.io> writes:

Rebased the patch with latest master and also addressed the test case
failure reported by PostgreSQL Patch Tester.

I looked this patch over, but I don't like it too much: it seems very
brute-force (and badly under-commented). Creating all those extra
RestrictInfos isn't too cheap in itself, plus they'll jam up the
equivalence-class machinery for future tests.

There is already something in equivclass.c that would almost do what
we want here: exprs_known_equal() would tell us whether the partkeys
can be found in the same eclass, without having to generate data
structures along the way. The current implementation is not watertight
because it doesn't check opclass semantics, but that consideration
can be bolted on readily enough. So that leads me to something like
the attached.

One argument that could be made against this approach is that if there
are a lot of partkey expressions, this requires O(N^2) calls to
exprs_known_equal, something that's already not too cheap. I think
that that's not a big problem because the number of partkey expressions
would only be equal to the join degree (ie it doesn't scale with the
number of partitions of the baserels) ... but maybe I'm wrong about
that? I also wonder if it's really necessary to check every pair
of partkey expressions. It seems at least plausible that in the
cases we care about, all the partkeys on each side would be in the same
eclasses anyway, so that comparing the first members of each list would
be sufficient. But I haven't beat on that point.

regards, tom lane

Attachments:

v3-0001-Fix-up-partitionwise-join.patchtext/x-diff; charset=us-ascii; name=v3-0001-Fix-up-partitionwise-join.patchDownload
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4ef1254..7c21692 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2074,15 +2074,17 @@ reconsider_full_join_clause(PlannerInfo *root, RestrictInfo *rinfo)
  *	  Detect whether two expressions are known equal due to equivalence
  *	  relationships.
  *
- * Actually, this only shows that the expressions are equal according
- * to some opfamily's notion of equality --- but we only use it for
- * selectivity estimation, so a fuzzy idea of equality is OK.
+ * If opfamily is given, the expressions must be known equal per the semantics
+ * of that opfamily (note it has to be a btree opfamily, since those are the
+ * only opfamilies equivclass.c deals with).  If opfamily is InvalidOid, we'll
+ * return true if they're equal according to any opfamily, which is fuzzy but
+ * OK for estimation purposes.
  *
  * Note: does not bother to check for "equal(item1, item2)"; caller must
  * check that case if it's possible to pass identical items.
  */
 bool
-exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
+exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2, Oid opfamily)
 {
 	ListCell   *lc1;
 
@@ -2097,6 +2099,17 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
 		if (ec->ec_has_volatile)
 			continue;
 
+		/*
+		 * It's okay to consider ec_broken ECs here.  Brokenness just means we
+		 * couldn't derive all the implied clauses we'd have liked to; it does
+		 * not invalidate our knowledge that the members are equal.
+		 */
+
+		/* Ignore if this EC doesn't use specified opfamily */
+		if (OidIsValid(opfamily) &&
+			!list_member_oid(ec->ec_opfamilies, opfamily))
+			continue;
+
 		foreach(lc2, ec->ec_members)
 		{
 			EquivalenceMember *em = (EquivalenceMember *) lfirst(lc2);
@@ -2125,8 +2138,7 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
  * (In principle there might be more than one matching eclass if multiple
  * collations are involved, but since collation doesn't matter for equality,
  * we ignore that fine point here.)  This is much like exprs_known_equal,
- * except that we insist on the comparison operator matching the eclass, so
- * that the result is definite not approximate.
+ * except for the format of the input.
  */
 EquivalenceClass *
 match_eclasses_to_foreign_key_col(PlannerInfo *root,
@@ -2163,7 +2175,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
 		/* Never match to a volatile EC */
 		if (ec->ec_has_volatile)
 			continue;
-		/* Note: it seems okay to match to "broken" eclasses here */
+		/* It's okay to consider "broken" ECs here, see exprs_known_equal */
 
 		foreach(lc2, ec->ec_members)
 		{
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index af1fb48..18474d8 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -56,10 +56,10 @@ static List *subbuild_joinrel_joinlist(RelOptInfo *joinrel,
 static void set_foreign_rel_properties(RelOptInfo *joinrel,
 									   RelOptInfo *outer_rel, RelOptInfo *inner_rel);
 static void add_join_rel(PlannerInfo *root, RelOptInfo *joinrel);
-static void build_joinrel_partition_info(RelOptInfo *joinrel,
+static void build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
 										 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 										 List *restrictlist, JoinType jointype);
-static bool have_partkey_equi_join(RelOptInfo *joinrel,
+static bool have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 								   RelOptInfo *rel1, RelOptInfo *rel2,
 								   JoinType jointype, List *restrictlist);
 static int	match_expr_to_partition_keys(Expr *expr, RelOptInfo *rel,
@@ -715,8 +715,8 @@ build_join_rel(PlannerInfo *root,
 	joinrel->has_eclass_joins = has_relevant_eclass_joinclause(root, joinrel);
 
 	/* Store the partition information. */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 sjinfo->jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, sjinfo->jointype);
 
 	/*
 	 * Set estimates of the joinrel's size.
@@ -879,8 +879,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->has_eclass_joins = parent_joinrel->has_eclass_joins;
 
 	/* Is the join between partitions itself partitioned? */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, jointype);
 
 	/* Child joinrel is parallel safe if parent is parallel safe. */
 	joinrel->consider_parallel = parent_joinrel->consider_parallel;
@@ -1621,9 +1621,9 @@ find_param_path_info(RelOptInfo *rel, Relids required_outer)
  *		partitioned join relation.
  */
 static void
-build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
-							 RelOptInfo *inner_rel, List *restrictlist,
-							 JoinType jointype)
+build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
+							 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
+							 List *restrictlist, JoinType jointype)
 {
 	PartitionScheme part_scheme;
 
@@ -1649,7 +1649,7 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		!outer_rel->consider_partitionwise_join ||
 		!inner_rel->consider_partitionwise_join ||
 		outer_rel->part_scheme != inner_rel->part_scheme ||
-		!have_partkey_equi_join(joinrel, outer_rel, inner_rel,
+		!have_partkey_equi_join(root, joinrel, outer_rel, inner_rel,
 								jointype, restrictlist))
 	{
 		Assert(!IS_PARTITIONED_REL(joinrel));
@@ -1713,15 +1713,14 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
  * partition keys.
  */
 static bool
-have_partkey_equi_join(RelOptInfo *joinrel,
+have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 					   RelOptInfo *rel1, RelOptInfo *rel2,
 					   JoinType jointype, List *restrictlist)
 {
 	PartitionScheme part_scheme = rel1->part_scheme;
-	ListCell   *lc;
-	int			cnt_pks;
 	bool		pk_has_clause[PARTITION_MAX_KEYS];
-	bool		strict_op;
+	int			pks_known_equal;
+	ListCell   *lc;
 
 	/*
 	 * This function must only be called when the joined relations have same
@@ -1730,13 +1729,19 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 	Assert(rel1->part_scheme == rel2->part_scheme);
 	Assert(part_scheme);
 
+	/* We use a bool array to track which partkey columns are known equal */
 	memset(pk_has_clause, 0, sizeof(pk_has_clause));
+	/* ... as well as a count of how many are known equal */
+	pks_known_equal = 0;
+
+	/* First, look through the join's restriction clauses */
 	foreach(lc, restrictlist)
 	{
 		RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
 		OpExpr	   *opexpr;
 		Expr	   *expr1;
 		Expr	   *expr2;
+		bool		strict_op;
 		int			ipk1;
 		int			ipk2;
 
@@ -1796,11 +1801,15 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 		if (ipk1 != ipk2)
 			continue;
 
+		/* Ignore clause if we already proved these keys equal. */
+		if (pk_has_clause[ipk1])
+			continue;
+
 		/*
 		 * The clause allows partitionwise join only if it uses the same
 		 * operator family as that specified by the partition key.
 		 */
-		if (rel1->part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
 		{
 			if (!OidIsValid(rinfo->hashjoinoperator) ||
 				!op_in_opfamily(rinfo->hashjoinoperator,
@@ -1813,16 +1822,88 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 
 		/* Mark the partition key as having an equi-join clause. */
 		pk_has_clause[ipk1] = true;
+
+		/* We can stop examining clauses once we prove all keys equal. */
+		if (++pks_known_equal == part_scheme->partnatts)
+			return true;
 	}
 
-	/* Check whether every partition key has an equi-join condition. */
-	for (cnt_pks = 0; cnt_pks < part_scheme->partnatts; cnt_pks++)
+	/*
+	 * Also check to see if any keys are known equal by equivclass.c.  In most
+	 * cases there would have been a join restriction clause generated from
+	 * any EC that had such knowledge, but there might be no such clause, or
+	 * it might happen to constrain other members of the ECs than the ones we
+	 * are looking for.
+	 */
+	for (int ipk = 0; ipk < part_scheme->partnatts; ipk++)
 	{
-		if (!pk_has_clause[cnt_pks])
-			return false;
+		Oid			btree_opfamily;
+
+		/* Ignore if we already proved these keys equal. */
+		if (pk_has_clause[ipk])
+			continue;
+
+		/*
+		 * We need a btree opfamily to ask equivclass.c about.  If the
+		 * partopfamily is a hash opfamily, look up its equality operator, and
+		 * select some btree opfamily that that operator is part of.  (Any
+		 * such opfamily should be good enough, since equivclass.c will track
+		 * multiple opfamilies as appropriate.)
+		 */
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		{
+			Oid			eq_op;
+			List	   *eq_opfamilies;
+
+			eq_op = get_opfamily_member(part_scheme->partopfamily[ipk],
+										part_scheme->partopcintype[ipk],
+										part_scheme->partopcintype[ipk],
+										HTEqualStrategyNumber);
+			if (!OidIsValid(eq_op))
+				break;			/* we're not going to succeed */
+			eq_opfamilies = get_mergejoin_opfamilies(eq_op);
+			if (eq_opfamilies == NIL)
+				break;			/* we're not going to succeed */
+			btree_opfamily = linitial_oid(eq_opfamilies);
+		}
+		else
+			btree_opfamily = part_scheme->partopfamily[ipk];
+
+		/*
+		 * We consider only non-nullable partition keys here; nullable ones
+		 * would not be treated as part of the same equivalence classes as
+		 * non-nullable ones.
+		 */
+		foreach(lc, rel1->partexprs[ipk])
+		{
+			Node	   *expr1 = (Node *) lfirst(lc);
+			ListCell   *lc2;
+
+			foreach(lc2, rel2->partexprs[ipk])
+			{
+				Node	   *expr2 = (Node *) lfirst(lc2);
+
+				if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
+				{
+					pk_has_clause[ipk] = true;
+					break;
+				}
+			}
+			if (pk_has_clause[ipk])
+				break;
+		}
+
+		if (pk_has_clause[ipk])
+		{
+			/* We can stop examining keys once we prove all keys equal. */
+			if (++pks_known_equal == part_scheme->partnatts)
+				return true;
+		}
+		else
+			break;				/* no chance to succeed, give up */
 	}
 
-	return true;
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 4fdcb07..5878a14 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3109,10 +3109,11 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 
 		/*
 		 * Drop known-equal vars, but only if they belong to different
-		 * relations (see comments for estimate_num_groups)
+		 * relations (see comments for estimate_num_groups).  We aren't too
+		 * fussy about the semantics of "equal" here.
 		 */
 		if (vardata->rel != varinfo->rel &&
-			exprs_known_equal(root, var, varinfo->var))
+			exprs_known_equal(root, var, varinfo->var, InvalidOid))
 		{
 			if (varinfo->ndistinct <= ndistinct)
 			{
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c689fe8..3756a44 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -142,7 +142,8 @@ extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  Relids join_relids,
 													  Relids outer_relids,
 													  RelOptInfo *inner_rel);
-extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2);
+extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2,
+							  Oid opfamily);
 extern EquivalenceClass *match_eclasses_to_foreign_key_col(PlannerInfo *root,
 														   ForeignKeyOptInfo *fkinfo,
 														   int colno);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index b3fbe47..9e8eeaf 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,45 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1_1.a = t2_1.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t2_1.b
+                     ->  Seq Scan on prt2_p1 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p2 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_3.a = t2_3.a)
+               ->  Seq Scan on prt1_p3 t1_3
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_3
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 575ba7b..1f218c0 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,11 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
#20Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#19)
Re: A problem about partitionwise join

On Sun, Apr 5, 2020 at 4:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <riguo@pivotal.io> writes:

Rebased the patch with latest master and also addressed the test case
failure reported by PostgreSQL Patch Tester.

I looked this patch over, but I don't like it too much: it seems very
brute-force (and badly under-commented). Creating all those extra
RestrictInfos isn't too cheap in itself, plus they'll jam up the
equivalence-class machinery for future tests.

Thanks for the review.

There is already something in equivclass.c that would almost do what
we want here: exprs_known_equal() would tell us whether the partkeys
can be found in the same eclass, without having to generate data
structures along the way. The current implementation is not watertight
because it doesn't check opclass semantics, but that consideration
can be bolted on readily enough. So that leads me to something like
the attached.

I looked through this patch and it's much more elegant than the previous
one. Thank you for working on it.

For partkeys which fail to be identified as equal by looking through
restrictlist, it's a good idea to check them in ECs with the help of
exprs_known_equal().

I have some concern about we only check non-nullable partexprs. Is it
possible that two nullable partexprs come from the same EC? I tried to
give an example but failed.

One argument that could be made against this approach is that if there
are a lot of partkey expressions, this requires O(N^2) calls to
exprs_known_equal, something that's already not too cheap. I think
that that's not a big problem because the number of partkey expressions
would only be equal to the join degree (ie it doesn't scale with the
number of partitions of the baserels) ... but maybe I'm wrong about
that?

You are right. According to how partexpr is formed for joinrel in
set_joinrel_partition_key_exprs(), each base relation within the join
contributes one partexpr, so the number of partexprs would be equal to
the join degree.

I also wonder if it's really necessary to check every pair
of partkey expressions. It seems at least plausible that in the
cases we care about, all the partkeys on each side would be in the same
eclasses anyway, so that comparing the first members of each list would
be sufficient. But I haven't beat on that point.

Not sure about it. But cannot come out with a counterexample.

Thanks
Richard

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#20)
Re: A problem about partitionwise join

Richard Guo <guofenglinux@gmail.com> writes:

On Sun, Apr 5, 2020 at 4:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There is already something in equivclass.c that would almost do what
we want here: exprs_known_equal() would tell us whether the partkeys
can be found in the same eclass, without having to generate data
structures along the way. The current implementation is not watertight
because it doesn't check opclass semantics, but that consideration
can be bolted on readily enough. So that leads me to something like
the attached.

I have some concern about we only check non-nullable partexprs. Is it
possible that two nullable partexprs come from the same EC? I tried to
give an example but failed.

Currently the EC infrastructure doesn't really cope with outer join
equijoins. They are not treated as producing true equivalences,
so I think that the case you're worried about can't occur (which is why
I didn't code for it). I have hopes of being able to incorporate outer
joins into the EC logic in a less squishy way in the future, by making
the representation of Vars distinguish explicitly between
value-before-outer-join and value-after-outer-join, after which we could
make bulletproof assertions about what is equal to what, even with outer
joins in the mix. If that works out it might produce a cleaner answer
in this area too.

TBH, now that I have had some exposure to the partitionwise join
matching logic I don't much like any of it. I feel like it's doing
about the same job as ECs, but in an unprincipled and not very
efficient manner. Right now is no time to redesign it, of course,
but maybe at some point we could do that. (I did experiment with
removing all the rest of have_partkey_equi_join() and having it
*only* ask exprs_known_equal() about equivalences, which is more or
less what I'm envisioning here. That caused some of the existing
regression tests to fail, so there's something that the idea isn't
covering. I didn't dig any further at the time, and in particular
failed to check whether the problems were specifically about outer
joins, which'd be unsurprising given the above.)

Anyway, this work has missed the window for v13, so we've got plenty
of time to think about it.

regards, tom lane

#22Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#21)
Re: A problem about partitionwise join

On Thu, Apr 9, 2020 at 1:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

On Sun, Apr 5, 2020 at 4:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There is already something in equivclass.c that would almost do what
we want here: exprs_known_equal() would tell us whether the partkeys
can be found in the same eclass, without having to generate data
structures along the way. The current implementation is not watertight
because it doesn't check opclass semantics, but that consideration
can be bolted on readily enough. So that leads me to something like
the attached.

I have some concern about we only check non-nullable partexprs. Is it
possible that two nullable partexprs come from the same EC? I tried to
give an example but failed.

Currently the EC infrastructure doesn't really cope with outer join
equijoins. They are not treated as producing true equivalences,
so I think that the case you're worried about can't occur (which is why
I didn't code for it). I have hopes of being able to incorporate outer
joins into the EC logic in a less squishy way in the future, by making
the representation of Vars distinguish explicitly between
value-before-outer-join and value-after-outer-join, after which we could
make bulletproof assertions about what is equal to what, even with outer
joins in the mix. If that works out it might produce a cleaner answer
in this area too.

This is very appealing. Do we have ongoing discussions/threads about
this idea?

(I did experiment with
removing all the rest of have_partkey_equi_join() and having it
*only* ask exprs_known_equal() about equivalences, which is more or
less what I'm envisioning here. That caused some of the existing
regression tests to fail, so there's something that the idea isn't
covering. I didn't dig any further at the time, and in particular
failed to check whether the problems were specifically about outer
joins, which'd be unsurprising given the above.)

I think it would not work for outer joins if we only check
exprs_known_equal() for equivalences. If the equi-join conditions
involving pairs of matching partition keys are outer join quals
mentioning nonnullable side rels, they would not exist in any EC
according to the current EC infrastructure. So we still have to look
through restrictlist.

Thanks
Richard

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#22)
Re: A problem about partitionwise join

Richard Guo <guofenglinux@gmail.com> writes:

On Thu, Apr 9, 2020 at 1:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have hopes of being able to incorporate outer
joins into the EC logic in a less squishy way in the future, by making
the representation of Vars distinguish explicitly between
value-before-outer-join and value-after-outer-join, after which we could
make bulletproof assertions about what is equal to what, even with outer
joins in the mix. If that works out it might produce a cleaner answer
in this area too.

This is very appealing. Do we have ongoing discussions/threads about
this idea?

There's some preliminary noodling in this thread:

/messages/by-id/15848.1576515643@sss.pgh.pa.us

I've pushed the earlier work discussed there, but stalled out due to
the call of other responsibilities after posting the currently-last
message in the thread. Hoping to get back into that over the summer.

regards, tom lane

#24Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Richard Guo (#22)
Re: A problem about partitionwise join

I think it would not work for outer joins if we only check
exprs_known_equal() for equivalences. If the equi-join conditions
involving pairs of matching partition keys are outer join quals
mentioning nonnullable side rels, they would not exist in any EC
according to the current EC infrastructure. So we still have to look
through restrictlist.

When I wrote that function and even today, EC didn't accommodate outer
join equality conditions. If we can somehow do that,
have_partkey_equi_join() can be completely eliminated.

--
Best Wishes,
Ashutosh Bapat

#25Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Richard Guo (#22)
Re: A problem about partitionwise join

Status update for a commitfest entry.

According to CFbot this patch fails to apply. Richard, can you send an update, please?

Also, I see that the thread was inactive for a while.
Are you going to continue this work? I think it would be helpful, if you could write a short recap about current state of the patch and list open questions for reviewers.

The new status of this patch is: Waiting on Author

#26Richard Guo
guofenglinux@gmail.com
In reply to: Anastasia Lubennikova (#25)
1 attachment(s)
Re: A problem about partitionwise join

On Fri, Nov 6, 2020 at 11:26 PM Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

Status update for a commitfest entry.

According to CFbot this patch fails to apply. Richard, can you send an
update, please?

Also, I see that the thread was inactive for a while.
Are you going to continue this work? I think it would be helpful, if you
could write a short recap about current state of the patch and list open
questions for reviewers.

The new status of this patch is: Waiting on Author

Thanks Anastasia. I've rebased the patch with latest master.

To recap, the problem we are fixing here is when generating join clauses
from equivalence classes, we only select the joinclause with the 'best
score', or the first joinclause with a score of 3. This may cause us to
miss some joinclause on partition keys and thus fail to generate
partitionwise join.

The initial idea for the fix is to create all the RestrictInfos from ECs
in order to check whether there exist equi-join conditions involving
pairs of matching partition keys of the relations being joined for all
partition keys. And then Tom proposed a much better idea which leverages
function exprs_known_equal() to tell whether the partkeys can be found
in the same eclass, which is the current implementation in the latest
patch.

Thanks
Richard

Attachments:

v4-0001-Fix-up-partitionwise-join.patchapplication/octet-stream; name=v4-0001-Fix-up-partitionwise-join.patchDownload
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index f98fd7b..8b526d6 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2200,15 +2200,17 @@ reconsider_full_join_clause(PlannerInfo *root, RestrictInfo *rinfo)
  *	  Detect whether two expressions are known equal due to equivalence
  *	  relationships.
  *
- * Actually, this only shows that the expressions are equal according
- * to some opfamily's notion of equality --- but we only use it for
- * selectivity estimation, so a fuzzy idea of equality is OK.
+ * If opfamily is given, the expressions must be known equal per the semantics
+ * of that opfamily (note it has to be a btree opfamily, since those are the
+ * only opfamilies equivclass.c deals with).  If opfamily is InvalidOid, we'll
+ * return true if they're equal according to any opfamily, which is fuzzy but
+ * OK for estimation purposes.
  *
  * Note: does not bother to check for "equal(item1, item2)"; caller must
  * check that case if it's possible to pass identical items.
  */
 bool
-exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
+exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2, Oid opfamily)
 {
 	ListCell   *lc1;
 
@@ -2223,6 +2225,17 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
 		if (ec->ec_has_volatile)
 			continue;
 
+		/*
+		 * It's okay to consider ec_broken ECs here.  Brokenness just means we
+		 * couldn't derive all the implied clauses we'd have liked to; it does
+		 * not invalidate our knowledge that the members are equal.
+		 */
+
+		/* Ignore if this EC doesn't use specified opfamily */
+		if (OidIsValid(opfamily) &&
+			!list_member_oid(ec->ec_opfamilies, opfamily))
+			continue;
+
 		foreach(lc2, ec->ec_members)
 		{
 			EquivalenceMember *em = (EquivalenceMember *) lfirst(lc2);
@@ -2251,8 +2264,7 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
  * (In principle there might be more than one matching eclass if multiple
  * collations are involved, but since collation doesn't matter for equality,
  * we ignore that fine point here.)  This is much like exprs_known_equal,
- * except that we insist on the comparison operator matching the eclass, so
- * that the result is definite not approximate.
+ * except for the format of the input.
  *
  * On success, we also set fkinfo->eclass[colno] to the matching eclass,
  * and set fkinfo->fk_eclass_member[colno] to the eclass member for the
@@ -2293,7 +2305,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
 		/* Never match to a volatile EC */
 		if (ec->ec_has_volatile)
 			continue;
-		/* Note: it seems okay to match to "broken" eclasses here */
+		/* It's okay to consider "broken" ECs here, see exprs_known_equal */
 
 		foreach(lc2, ec->ec_members)
 		{
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 76245c1..9247e75 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -56,10 +56,10 @@ static List *subbuild_joinrel_joinlist(RelOptInfo *joinrel,
 static void set_foreign_rel_properties(RelOptInfo *joinrel,
 									   RelOptInfo *outer_rel, RelOptInfo *inner_rel);
 static void add_join_rel(PlannerInfo *root, RelOptInfo *joinrel);
-static void build_joinrel_partition_info(RelOptInfo *joinrel,
+static void build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
 										 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 										 List *restrictlist, JoinType jointype);
-static bool have_partkey_equi_join(RelOptInfo *joinrel,
+static bool have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 								   RelOptInfo *rel1, RelOptInfo *rel2,
 								   JoinType jointype, List *restrictlist);
 static int	match_expr_to_partition_keys(Expr *expr, RelOptInfo *rel,
@@ -717,8 +717,8 @@ build_join_rel(PlannerInfo *root,
 	joinrel->has_eclass_joins = has_relevant_eclass_joinclause(root, joinrel);
 
 	/* Store the partition information. */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 sjinfo->jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, sjinfo->jointype);
 
 	/*
 	 * Set estimates of the joinrel's size.
@@ -882,8 +882,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->has_eclass_joins = parent_joinrel->has_eclass_joins;
 
 	/* Is the join between partitions itself partitioned? */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, jointype);
 
 	/* Child joinrel is parallel safe if parent is parallel safe. */
 	joinrel->consider_parallel = parent_joinrel->consider_parallel;
@@ -1624,9 +1624,9 @@ find_param_path_info(RelOptInfo *rel, Relids required_outer)
  *		partitioned join relation.
  */
 static void
-build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
-							 RelOptInfo *inner_rel, List *restrictlist,
-							 JoinType jointype)
+build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
+							 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
+							 List *restrictlist, JoinType jointype)
 {
 	PartitionScheme part_scheme;
 
@@ -1652,7 +1652,7 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		!outer_rel->consider_partitionwise_join ||
 		!inner_rel->consider_partitionwise_join ||
 		outer_rel->part_scheme != inner_rel->part_scheme ||
-		!have_partkey_equi_join(joinrel, outer_rel, inner_rel,
+		!have_partkey_equi_join(root, joinrel, outer_rel, inner_rel,
 								jointype, restrictlist))
 	{
 		Assert(!IS_PARTITIONED_REL(joinrel));
@@ -1695,15 +1695,14 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
  * partition keys.
  */
 static bool
-have_partkey_equi_join(RelOptInfo *joinrel,
+have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 					   RelOptInfo *rel1, RelOptInfo *rel2,
 					   JoinType jointype, List *restrictlist)
 {
 	PartitionScheme part_scheme = rel1->part_scheme;
-	ListCell   *lc;
-	int			cnt_pks;
 	bool		pk_has_clause[PARTITION_MAX_KEYS];
-	bool		strict_op;
+	int			pks_known_equal;
+	ListCell   *lc;
 
 	/*
 	 * This function must only be called when the joined relations have same
@@ -1712,13 +1711,19 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 	Assert(rel1->part_scheme == rel2->part_scheme);
 	Assert(part_scheme);
 
+	/* We use a bool array to track which partkey columns are known equal */
 	memset(pk_has_clause, 0, sizeof(pk_has_clause));
+	/* ... as well as a count of how many are known equal */
+	pks_known_equal = 0;
+
+	/* First, look through the join's restriction clauses */
 	foreach(lc, restrictlist)
 	{
 		RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
 		OpExpr	   *opexpr;
 		Expr	   *expr1;
 		Expr	   *expr2;
+		bool		strict_op;
 		int			ipk1;
 		int			ipk2;
 
@@ -1778,11 +1783,15 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 		if (ipk1 != ipk2)
 			continue;
 
+		/* Ignore clause if we already proved these keys equal. */
+		if (pk_has_clause[ipk1])
+			continue;
+
 		/*
 		 * The clause allows partitionwise join only if it uses the same
 		 * operator family as that specified by the partition key.
 		 */
-		if (rel1->part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
 		{
 			if (!OidIsValid(rinfo->hashjoinoperator) ||
 				!op_in_opfamily(rinfo->hashjoinoperator,
@@ -1795,16 +1804,88 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 
 		/* Mark the partition key as having an equi-join clause. */
 		pk_has_clause[ipk1] = true;
+
+		/* We can stop examining clauses once we prove all keys equal. */
+		if (++pks_known_equal == part_scheme->partnatts)
+			return true;
 	}
 
-	/* Check whether every partition key has an equi-join condition. */
-	for (cnt_pks = 0; cnt_pks < part_scheme->partnatts; cnt_pks++)
+	/*
+	 * Also check to see if any keys are known equal by equivclass.c.  In most
+	 * cases there would have been a join restriction clause generated from
+	 * any EC that had such knowledge, but there might be no such clause, or
+	 * it might happen to constrain other members of the ECs than the ones we
+	 * are looking for.
+	 */
+	for (int ipk = 0; ipk < part_scheme->partnatts; ipk++)
 	{
-		if (!pk_has_clause[cnt_pks])
-			return false;
+		Oid			btree_opfamily;
+
+		/* Ignore if we already proved these keys equal. */
+		if (pk_has_clause[ipk])
+			continue;
+
+		/*
+		 * We need a btree opfamily to ask equivclass.c about.  If the
+		 * partopfamily is a hash opfamily, look up its equality operator, and
+		 * select some btree opfamily that that operator is part of.  (Any
+		 * such opfamily should be good enough, since equivclass.c will track
+		 * multiple opfamilies as appropriate.)
+		 */
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		{
+			Oid			eq_op;
+			List	   *eq_opfamilies;
+
+			eq_op = get_opfamily_member(part_scheme->partopfamily[ipk],
+										part_scheme->partopcintype[ipk],
+										part_scheme->partopcintype[ipk],
+										HTEqualStrategyNumber);
+			if (!OidIsValid(eq_op))
+				break;			/* we're not going to succeed */
+			eq_opfamilies = get_mergejoin_opfamilies(eq_op);
+			if (eq_opfamilies == NIL)
+				break;			/* we're not going to succeed */
+			btree_opfamily = linitial_oid(eq_opfamilies);
+		}
+		else
+			btree_opfamily = part_scheme->partopfamily[ipk];
+
+		/*
+		 * We consider only non-nullable partition keys here; nullable ones
+		 * would not be treated as part of the same equivalence classes as
+		 * non-nullable ones.
+		 */
+		foreach(lc, rel1->partexprs[ipk])
+		{
+			Node	   *expr1 = (Node *) lfirst(lc);
+			ListCell   *lc2;
+
+			foreach(lc2, rel2->partexprs[ipk])
+			{
+				Node	   *expr2 = (Node *) lfirst(lc2);
+
+				if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
+				{
+					pk_has_clause[ipk] = true;
+					break;
+				}
+			}
+			if (pk_has_clause[ipk])
+				break;
+		}
+
+		if (pk_has_clause[ipk])
+		{
+			/* We can stop examining keys once we prove all keys equal. */
+			if (++pks_known_equal == part_scheme->partnatts)
+				return true;
+		}
+		else
+			break;				/* no chance to succeed, give up */
 	}
 
-	return true;
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bec357f..f75bfe4 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3264,10 +3264,11 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 
 		/*
 		 * Drop known-equal vars, but only if they belong to different
-		 * relations (see comments for estimate_num_groups)
+		 * relations (see comments for estimate_num_groups).  We aren't too
+		 * fussy about the semantics of "equal" here.
 		 */
 		if (vardata->rel != varinfo->rel &&
-			exprs_known_equal(root, var, varinfo->var))
+			exprs_known_equal(root, var, varinfo->var, InvalidOid))
 		{
 			if (varinfo->ndistinct <= ndistinct)
 			{
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 8a4c6f8..a866d66 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -146,7 +146,8 @@ extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  Relids join_relids,
 													  Relids outer_relids,
 													  RelOptInfo *inner_rel);
-extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2);
+extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2,
+							  Oid opfamily);
 extern EquivalenceClass *match_eclasses_to_foreign_key_col(PlannerInfo *root,
 														   ForeignKeyOptInfo *fkinfo,
 														   int colno);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 0057f41..d8c7584 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,45 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1_1.a = t2_1.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t2_1.b
+                     ->  Seq Scan on prt2_p1 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p2 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_3.a = t2_3.a)
+               ->  Seq Scan on prt1_p3 t1_3
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_3
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b6..6b7d689 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,11 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
#27Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Richard Guo (#26)
Re: A problem about partitionwise join

On Tue, Nov 10, 2020 at 2:43 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Fri, Nov 6, 2020 at 11:26 PM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

Status update for a commitfest entry.

According to CFbot this patch fails to apply. Richard, can you send an update, please?

Also, I see that the thread was inactive for a while.
Are you going to continue this work? I think it would be helpful, if you could write a short recap about current state of the patch and list open questions for reviewers.

The new status of this patch is: Waiting on Author

Thanks Anastasia. I've rebased the patch with latest master.

To recap, the problem we are fixing here is when generating join clauses
from equivalence classes, we only select the joinclause with the 'best
score', or the first joinclause with a score of 3. This may cause us to
miss some joinclause on partition keys and thus fail to generate
partitionwise join.

The initial idea for the fix is to create all the RestrictInfos from ECs
in order to check whether there exist equi-join conditions involving
pairs of matching partition keys of the relations being joined for all
partition keys. And then Tom proposed a much better idea which leverages
function exprs_known_equal() to tell whether the partkeys can be found
in the same eclass, which is the current implementation in the latest
patch.

In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

The loop traversing the partition keys at a given position, may be
optimized further if we pass lists to exprs_known_equal() which in
turns checks whether one expression from each list is member of a
given EC. This will avoid traversing all equivalence classes for each
partition key expression, which can be a huge improvement when there
are many ECs. But I think if one of the partition key expression at a
given position is member of an equivalence class all the other
partition key expressions at that position should be part of that
equivalence class since there should be an equi-join between those. So
the loop in loop may not be required to start with.

--
Best Wishes,
Ashutosh Bapat

#28David Steele
david@pgmasters.net
In reply to: Ashutosh Bapat (#27)
Re: A problem about partitionwise join

On 11/27/20 7:05 AM, Ashutosh Bapat wrote:

On Tue, Nov 10, 2020 at 2:43 PM Richard Guo <guofenglinux@gmail.com> wrote:

To recap, the problem we are fixing here is when generating join clauses
from equivalence classes, we only select the joinclause with the 'best
score', or the first joinclause with a score of 3. This may cause us to
miss some joinclause on partition keys and thus fail to generate
partitionwise join.

The initial idea for the fix is to create all the RestrictInfos from ECs
in order to check whether there exist equi-join conditions involving
pairs of matching partition keys of the relations being joined for all
partition keys. And then Tom proposed a much better idea which leverages
function exprs_known_equal() to tell whether the partkeys can be found
in the same eclass, which is the current implementation in the latest
patch.

In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

The loop traversing the partition keys at a given position, may be
optimized further if we pass lists to exprs_known_equal() which in
turns checks whether one expression from each list is member of a
given EC. This will avoid traversing all equivalence classes for each
partition key expression, which can be a huge improvement when there
are many ECs. But I think if one of the partition key expression at a
given position is member of an equivalence class all the other
partition key expressions at that position should be part of that
equivalence class since there should be an equi-join between those. So
the loop in loop may not be required to start with.

Richard, any thoughts on Ashutosh's comments?

Regards,
--
-David
david@pgmasters.net

#29Richard Guo
guofenglinux@gmail.com
In reply to: Ashutosh Bapat (#27)
1 attachment(s)
Re: A problem about partitionwise join

On Fri, Nov 27, 2020 at 8:05 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

On Tue, Nov 10, 2020 at 2:43 PM Richard Guo <guofenglinux@gmail.com>
wrote:

Thanks Anastasia. I've rebased the patch with latest master.

To recap, the problem we are fixing here is when generating join clauses
from equivalence classes, we only select the joinclause with the 'best
score', or the first joinclause with a score of 3. This may cause us to
miss some joinclause on partition keys and thus fail to generate
partitionwise join.

The initial idea for the fix is to create all the RestrictInfos from ECs
in order to check whether there exist equi-join conditions involving
pairs of matching partition keys of the relations being joined for all
partition keys. And then Tom proposed a much better idea which leverages
function exprs_known_equal() to tell whether the partkeys can be found
in the same eclass, which is the current implementation in the latest
patch.

In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

Correct. So with Tom's idea partition-wise join can cope with clauses
such as 'foo.k1 = bar.k1 and foo.k2 = 16 and bar.k2 = 16'.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

Thanks for the suggestion. Will do that in the new version of patch.

The loop traversing the partition keys at a given position, may be
optimized further if we pass lists to exprs_known_equal() which in
turns checks whether one expression from each list is member of a
given EC. This will avoid traversing all equivalence classes for each
partition key expression, which can be a huge improvement when there
are many ECs. But I think if one of the partition key expression at a
given position is member of an equivalence class all the other
partition key expressions at that position should be part of that
equivalence class since there should be an equi-join between those. So
the loop in loop may not be required to start with.

Good point. Quote from one of Tom's earlier emails,
"It seems at least plausible that in the cases we care about, all the
partkeys on each side would be in the same eclasses anyway, so that
comparing the first members of each list would be sufficient."

But I'm not sure if this holds true in all cases. However, since each
base relation within the join contributes only one partexpr, the number
of partexprs would only be equal to the join degree. Thus the loop in
loop may not be a big problem?

PS. Sorry for delaying so long time!

Thanks
Richard

Attachments:

v5-0001-Fix-up-partitionwise-join.patchapplication/octet-stream; name=v5-0001-Fix-up-partitionwise-join.patchDownload
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 6f1abbe47d..aa05c60bd8 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2375,15 +2375,17 @@ reconsider_full_join_clause(PlannerInfo *root, RestrictInfo *rinfo)
  *	  Detect whether two expressions are known equal due to equivalence
  *	  relationships.
  *
- * Actually, this only shows that the expressions are equal according
- * to some opfamily's notion of equality --- but we only use it for
- * selectivity estimation, so a fuzzy idea of equality is OK.
+ * If opfamily is given, the expressions must be known equal per the semantics
+ * of that opfamily (note it has to be a btree opfamily, since those are the
+ * only opfamilies equivclass.c deals with).  If opfamily is InvalidOid, we'll
+ * return true if they're equal according to any opfamily, which is fuzzy but
+ * OK for estimation purposes.
  *
  * Note: does not bother to check for "equal(item1, item2)"; caller must
  * check that case if it's possible to pass identical items.
  */
 bool
-exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
+exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2, Oid opfamily)
 {
 	ListCell   *lc1;
 
@@ -2398,6 +2400,17 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
 		if (ec->ec_has_volatile)
 			continue;
 
+		/*
+		 * It's okay to consider ec_broken ECs here.  Brokenness just means we
+		 * couldn't derive all the implied clauses we'd have liked to; it does
+		 * not invalidate our knowledge that the members are equal.
+		 */
+
+		/* Ignore if this EC doesn't use specified opfamily */
+		if (OidIsValid(opfamily) &&
+			!list_member_oid(ec->ec_opfamilies, opfamily))
+			continue;
+
 		foreach(lc2, ec->ec_members)
 		{
 			EquivalenceMember *em = (EquivalenceMember *) lfirst(lc2);
@@ -2426,8 +2439,7 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
  * (In principle there might be more than one matching eclass if multiple
  * collations are involved, but since collation doesn't matter for equality,
  * we ignore that fine point here.)  This is much like exprs_known_equal,
- * except that we insist on the comparison operator matching the eclass, so
- * that the result is definite not approximate.
+ * except for the format of the input.
  *
  * On success, we also set fkinfo->eclass[colno] to the matching eclass,
  * and set fkinfo->fk_eclass_member[colno] to the eclass member for the
@@ -2468,7 +2480,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
 		/* Never match to a volatile EC */
 		if (ec->ec_has_volatile)
 			continue;
-		/* Note: it seems okay to match to "broken" eclasses here */
+		/* It's okay to consider "broken" ECs here, see exprs_known_equal */
 
 		foreach(lc2, ec->ec_members)
 		{
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..40b47ca85e 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -56,10 +56,10 @@ static List *subbuild_joinrel_joinlist(RelOptInfo *joinrel,
 static void set_foreign_rel_properties(RelOptInfo *joinrel,
 									   RelOptInfo *outer_rel, RelOptInfo *inner_rel);
 static void add_join_rel(PlannerInfo *root, RelOptInfo *joinrel);
-static void build_joinrel_partition_info(RelOptInfo *joinrel,
+static void build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
 										 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 										 List *restrictlist, JoinType jointype);
-static bool have_partkey_equi_join(RelOptInfo *joinrel,
+static bool have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 								   RelOptInfo *rel1, RelOptInfo *rel2,
 								   JoinType jointype, List *restrictlist);
 static int	match_expr_to_partition_keys(Expr *expr, RelOptInfo *rel,
@@ -718,8 +718,8 @@ build_join_rel(PlannerInfo *root,
 	joinrel->has_eclass_joins = has_relevant_eclass_joinclause(root, joinrel);
 
 	/* Store the partition information. */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 sjinfo->jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, sjinfo->jointype);
 
 	/*
 	 * Set estimates of the joinrel's size.
@@ -884,8 +884,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->has_eclass_joins = parent_joinrel->has_eclass_joins;
 
 	/* Is the join between partitions itself partitioned? */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, jointype);
 
 	/* Child joinrel is parallel safe if parent is parallel safe. */
 	joinrel->consider_parallel = parent_joinrel->consider_parallel;
@@ -1642,9 +1642,9 @@ find_param_path_info(RelOptInfo *rel, Relids required_outer)
  *		partitioned join relation.
  */
 static void
-build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
-							 RelOptInfo *inner_rel, List *restrictlist,
-							 JoinType jointype)
+build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
+							 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
+							 List *restrictlist, JoinType jointype)
 {
 	PartitionScheme part_scheme;
 
@@ -1670,7 +1670,7 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		!outer_rel->consider_partitionwise_join ||
 		!inner_rel->consider_partitionwise_join ||
 		outer_rel->part_scheme != inner_rel->part_scheme ||
-		!have_partkey_equi_join(joinrel, outer_rel, inner_rel,
+		!have_partkey_equi_join(root, joinrel, outer_rel, inner_rel,
 								jointype, restrictlist))
 	{
 		Assert(!IS_PARTITIONED_REL(joinrel));
@@ -1713,15 +1713,14 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
  * partition keys.
  */
 static bool
-have_partkey_equi_join(RelOptInfo *joinrel,
+have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 					   RelOptInfo *rel1, RelOptInfo *rel2,
 					   JoinType jointype, List *restrictlist)
 {
 	PartitionScheme part_scheme = rel1->part_scheme;
+	bool		pk_known_equal[PARTITION_MAX_KEYS];
+	int			num_equal_pks;
 	ListCell   *lc;
-	int			cnt_pks;
-	bool		pk_has_clause[PARTITION_MAX_KEYS];
-	bool		strict_op;
 
 	/*
 	 * This function must only be called when the joined relations have same
@@ -1730,13 +1729,19 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 	Assert(rel1->part_scheme == rel2->part_scheme);
 	Assert(part_scheme);
 
-	memset(pk_has_clause, 0, sizeof(pk_has_clause));
+	/* We use a bool array to track which partkey columns are known equal */
+	memset(pk_known_equal, 0, sizeof(pk_known_equal));
+	/* ... as well as a count of how many are known equal */
+	num_equal_pks = 0;
+
+	/* First, look through the join's restriction clauses */
 	foreach(lc, restrictlist)
 	{
 		RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
 		OpExpr	   *opexpr;
 		Expr	   *expr1;
 		Expr	   *expr2;
+		bool		strict_op;
 		int			ipk1;
 		int			ipk2;
 
@@ -1796,11 +1801,15 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 		if (ipk1 != ipk2)
 			continue;
 
+		/* Ignore clause if we already proved these keys equal. */
+		if (pk_known_equal[ipk1])
+			continue;
+
 		/*
 		 * The clause allows partitionwise join only if it uses the same
 		 * operator family as that specified by the partition key.
 		 */
-		if (rel1->part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
 		{
 			if (!OidIsValid(rinfo->hashjoinoperator) ||
 				!op_in_opfamily(rinfo->hashjoinoperator,
@@ -1812,17 +1821,89 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 			continue;
 
 		/* Mark the partition key as having an equi-join clause. */
-		pk_has_clause[ipk1] = true;
+		pk_known_equal[ipk1] = true;
+
+		/* We can stop examining clauses once we prove all keys equal. */
+		if (++num_equal_pks == part_scheme->partnatts)
+			return true;
 	}
 
-	/* Check whether every partition key has an equi-join condition. */
-	for (cnt_pks = 0; cnt_pks < part_scheme->partnatts; cnt_pks++)
+	/*
+	 * Also check to see if any keys are known equal by equivclass.c.  In most
+	 * cases there would have been a join restriction clause generated from
+	 * any EC that had such knowledge, but there might be no such clause, or
+	 * it might happen to constrain other members of the ECs than the ones we
+	 * are looking for.
+	 */
+	for (int ipk = 0; ipk < part_scheme->partnatts; ipk++)
 	{
-		if (!pk_has_clause[cnt_pks])
-			return false;
+		Oid			btree_opfamily;
+
+		/* Ignore if we already proved these keys equal. */
+		if (pk_known_equal[ipk])
+			continue;
+
+		/*
+		 * We need a btree opfamily to ask equivclass.c about.  If the
+		 * partopfamily is a hash opfamily, look up its equality operator, and
+		 * select some btree opfamily that that operator is part of.  (Any
+		 * such opfamily should be good enough, since equivclass.c will track
+		 * multiple opfamilies as appropriate.)
+		 */
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		{
+			Oid			eq_op;
+			List	   *eq_opfamilies;
+
+			eq_op = get_opfamily_member(part_scheme->partopfamily[ipk],
+										part_scheme->partopcintype[ipk],
+										part_scheme->partopcintype[ipk],
+										HTEqualStrategyNumber);
+			if (!OidIsValid(eq_op))
+				break;			/* we're not going to succeed */
+			eq_opfamilies = get_mergejoin_opfamilies(eq_op);
+			if (eq_opfamilies == NIL)
+				break;			/* we're not going to succeed */
+			btree_opfamily = linitial_oid(eq_opfamilies);
+		}
+		else
+			btree_opfamily = part_scheme->partopfamily[ipk];
+
+		/*
+		 * We consider only non-nullable partition keys here; nullable ones
+		 * would not be treated as part of the same equivalence classes as
+		 * non-nullable ones.
+		 */
+		foreach(lc, rel1->partexprs[ipk])
+		{
+			Node	   *expr1 = (Node *) lfirst(lc);
+			ListCell   *lc2;
+
+			foreach(lc2, rel2->partexprs[ipk])
+			{
+				Node	   *expr2 = (Node *) lfirst(lc2);
+
+				if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
+				{
+					pk_known_equal[ipk] = true;
+					break;
+				}
+			}
+			if (pk_known_equal[ipk])
+				break;
+		}
+
+		if (pk_known_equal[ipk])
+		{
+			/* We can stop examining keys once we prove all keys equal. */
+			if (++num_equal_pks == part_scheme->partnatts)
+				return true;
+		}
+		else
+			break;				/* no chance to succeed, give up */
 	}
 
-	return true;
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0c8c05f6c2..090ce8656e 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3265,10 +3265,11 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 
 		/*
 		 * Drop known-equal vars, but only if they belong to different
-		 * relations (see comments for estimate_num_groups)
+		 * relations (see comments for estimate_num_groups).  We aren't too
+		 * fussy about the semantics of "equal" here.
 		 */
 		if (vardata->rel != varinfo->rel &&
-			exprs_known_equal(root, var, varinfo->var))
+			exprs_known_equal(root, var, varinfo->var, InvalidOid))
 		{
 			if (varinfo->ndistinct <= ndistinct)
 			{
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..e6488d3f37 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -157,7 +157,8 @@ extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  Relids join_relids,
 													  Relids outer_relids,
 													  RelOptInfo *inner_rel);
-extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2);
+extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2,
+							  Oid opfamily);
 extern EquivalenceClass *match_eclasses_to_foreign_key_col(PlannerInfo *root,
 														   ForeignKeyOptInfo *fkinfo,
 														   int colno);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27f7525b3e..2bdee1f804 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,45 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1_1.a = t2_1.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t2_1.b
+                     ->  Seq Scan on prt2_p1 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p2 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_3.a = t2_3.a)
+               ->  Seq Scan on prt1_p3 t1_3
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_3
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index d97b5b69ff..6b7d6899dd 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,11 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
#30Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Richard Guo (#29)
Re: A problem about partitionwise join

On Wed, Jul 21, 2021 at 04:44:53PM +0800, Richard Guo wrote:

On Fri, Nov 27, 2020 at 8:05 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

Correct. So with Tom's idea partition-wise join can cope with clauses
such as 'foo.k1 = bar.k1 and foo.k2 = 16 and bar.k2 = 16'.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

Thanks for the suggestion. Will do that in the new version of patch.

Hi Richard,

We are marking this CF entry as "Returned with Feedback", which means
you are encouraged to send a new patch (and create a new entry for a
future CF for it) with the suggested changes.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#31Richard Guo
guofenglinux@gmail.com
In reply to: Jaime Casanova (#30)
Re: A problem about partitionwise join

On Wed, Oct 6, 2021 at 1:19 AM Jaime Casanova <jcasanov@systemguards.com.ec>
wrote:

On Wed, Jul 21, 2021 at 04:44:53PM +0800, Richard Guo wrote:

On Fri, Nov 27, 2020 at 8:05 PM Ashutosh Bapat <

ashutosh.bapat.oss@gmail.com>

wrote:

In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

Correct. So with Tom's idea partition-wise join can cope with clauses
such as 'foo.k1 = bar.k1 and foo.k2 = 16 and bar.k2 = 16'.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

Thanks for the suggestion. Will do that in the new version of patch.

Hi Richard,

We are marking this CF entry as "Returned with Feedback", which means
you are encouraged to send a new patch (and create a new entry for a
future CF for it) with the suggested changes.

Hi,

The suggested changes have already been included in v5 patch. Sorry for
the confusion.

Verified that the patch still applies and works on latest master. So I'm
moving it to the next CF (which is Commitfest 2022-01). Please correct
me if this is not the right thing to do.

Thanks
Richard

#32Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#31)
1 attachment(s)
Re: A problem about partitionwise join

On Mon, Nov 22, 2021 at 3:04 PM Richard Guo <guofenglinux@gmail.com> wrote:

The suggested changes have already been included in v5 patch. Sorry for
the confusion.

Verified that the patch still applies and works on latest master. So I'm
moving it to the next CF (which is Commitfest 2022-01). Please correct
me if this is not the right thing to do.

Rebased the patch with latest master. Appreciate any comments.

Thanks
Richard

Attachments:

v6-0001-Fix-up-partitionwise-join.patchapplication/octet-stream; name=v6-0001-Fix-up-partitionwise-join.patchDownload
From 90c04624f24cb147820e5db98f66fb3b48f58c9b Mon Sep 17 00:00:00 2001
From: pgsql-guo <richard.guo@openpie.com>
Date: Mon, 25 Apr 2022 07:03:42 +0000
Subject: [PATCH v6] Fix up partitionwise join

---
 src/backend/optimizer/path/equivclass.c      |  26 ++--
 src/backend/optimizer/util/relnode.c         | 125 +++++++++++++++----
 src/backend/utils/adt/selfuncs.c             |   5 +-
 src/include/optimizer/paths.h                |   3 +-
 src/test/regress/expected/partition_join.out |  39 ++++++
 src/test/regress/sql/partition_join.sql      |   5 +
 6 files changed, 171 insertions(+), 32 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 34c5ab1cb6..b034c95c9d 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2357,15 +2357,17 @@ reconsider_full_join_clause(PlannerInfo *root, RestrictInfo *rinfo)
  *	  Detect whether two expressions are known equal due to equivalence
  *	  relationships.
  *
- * Actually, this only shows that the expressions are equal according
- * to some opfamily's notion of equality --- but we only use it for
- * selectivity estimation, so a fuzzy idea of equality is OK.
+ * If opfamily is given, the expressions must be known equal per the semantics
+ * of that opfamily (note it has to be a btree opfamily, since those are the
+ * only opfamilies equivclass.c deals with).  If opfamily is InvalidOid, we'll
+ * return true if they're equal according to any opfamily, which is fuzzy but
+ * OK for estimation purposes.
  *
  * Note: does not bother to check for "equal(item1, item2)"; caller must
  * check that case if it's possible to pass identical items.
  */
 bool
-exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
+exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2, Oid opfamily)
 {
 	ListCell   *lc1;
 
@@ -2380,6 +2382,17 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
 		if (ec->ec_has_volatile)
 			continue;
 
+		/*
+		 * It's okay to consider ec_broken ECs here.  Brokenness just means we
+		 * couldn't derive all the implied clauses we'd have liked to; it does
+		 * not invalidate our knowledge that the members are equal.
+		 */
+
+		/* Ignore if this EC doesn't use specified opfamily */
+		if (OidIsValid(opfamily) &&
+			!list_member_oid(ec->ec_opfamilies, opfamily))
+			continue;
+
 		foreach(lc2, ec->ec_members)
 		{
 			EquivalenceMember *em = (EquivalenceMember *) lfirst(lc2);
@@ -2408,8 +2421,7 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
  * (In principle there might be more than one matching eclass if multiple
  * collations are involved, but since collation doesn't matter for equality,
  * we ignore that fine point here.)  This is much like exprs_known_equal,
- * except that we insist on the comparison operator matching the eclass, so
- * that the result is definite not approximate.
+ * except for the format of the input.
  *
  * On success, we also set fkinfo->eclass[colno] to the matching eclass,
  * and set fkinfo->fk_eclass_member[colno] to the eclass member for the
@@ -2450,7 +2462,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
 		/* Never match to a volatile EC */
 		if (ec->ec_has_volatile)
 			continue;
-		/* Note: it seems okay to match to "broken" eclasses here */
+		/* It's okay to consider "broken" ECs here, see exprs_known_equal */
 
 		foreach(lc2, ec->ec_members)
 		{
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 520409f4ba..7bca3074eb 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -56,10 +56,10 @@ static List *subbuild_joinrel_joinlist(RelOptInfo *joinrel,
 static void set_foreign_rel_properties(RelOptInfo *joinrel,
 									   RelOptInfo *outer_rel, RelOptInfo *inner_rel);
 static void add_join_rel(PlannerInfo *root, RelOptInfo *joinrel);
-static void build_joinrel_partition_info(RelOptInfo *joinrel,
+static void build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
 										 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
 										 List *restrictlist, JoinType jointype);
-static bool have_partkey_equi_join(RelOptInfo *joinrel,
+static bool have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 								   RelOptInfo *rel1, RelOptInfo *rel2,
 								   JoinType jointype, List *restrictlist);
 static int	match_expr_to_partition_keys(Expr *expr, RelOptInfo *rel,
@@ -720,8 +720,8 @@ build_join_rel(PlannerInfo *root,
 	joinrel->has_eclass_joins = has_relevant_eclass_joinclause(root, joinrel);
 
 	/* Store the partition information. */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 sjinfo->jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, sjinfo->jointype);
 
 	/*
 	 * Set estimates of the joinrel's size.
@@ -887,8 +887,8 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->has_eclass_joins = parent_joinrel->has_eclass_joins;
 
 	/* Is the join between partitions itself partitioned? */
-	build_joinrel_partition_info(joinrel, outer_rel, inner_rel, restrictlist,
-								 jointype);
+	build_joinrel_partition_info(root, joinrel, outer_rel, inner_rel,
+								 restrictlist, jointype);
 
 	/* Child joinrel is parallel safe if parent is parallel safe. */
 	joinrel->consider_parallel = parent_joinrel->consider_parallel;
@@ -1645,9 +1645,9 @@ find_param_path_info(RelOptInfo *rel, Relids required_outer)
  *		partitioned join relation.
  */
 static void
-build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
-							 RelOptInfo *inner_rel, List *restrictlist,
-							 JoinType jointype)
+build_joinrel_partition_info(PlannerInfo *root, RelOptInfo *joinrel,
+							 RelOptInfo *outer_rel, RelOptInfo *inner_rel,
+							 List *restrictlist, JoinType jointype)
 {
 	PartitionScheme part_scheme;
 
@@ -1673,7 +1673,7 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		!outer_rel->consider_partitionwise_join ||
 		!inner_rel->consider_partitionwise_join ||
 		outer_rel->part_scheme != inner_rel->part_scheme ||
-		!have_partkey_equi_join(joinrel, outer_rel, inner_rel,
+		!have_partkey_equi_join(root, joinrel, outer_rel, inner_rel,
 								jointype, restrictlist))
 	{
 		Assert(!IS_PARTITIONED_REL(joinrel));
@@ -1716,15 +1716,14 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
  * partition keys.
  */
 static bool
-have_partkey_equi_join(RelOptInfo *joinrel,
+have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 					   RelOptInfo *rel1, RelOptInfo *rel2,
 					   JoinType jointype, List *restrictlist)
 {
 	PartitionScheme part_scheme = rel1->part_scheme;
+	bool		pk_known_equal[PARTITION_MAX_KEYS];
+	int			num_equal_pks;
 	ListCell   *lc;
-	int			cnt_pks;
-	bool		pk_has_clause[PARTITION_MAX_KEYS];
-	bool		strict_op;
 
 	/*
 	 * This function must only be called when the joined relations have same
@@ -1733,13 +1732,19 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 	Assert(rel1->part_scheme == rel2->part_scheme);
 	Assert(part_scheme);
 
-	memset(pk_has_clause, 0, sizeof(pk_has_clause));
+	/* We use a bool array to track which partkey columns are known equal */
+	memset(pk_known_equal, 0, sizeof(pk_known_equal));
+	/* ... as well as a count of how many are known equal */
+	num_equal_pks = 0;
+
+	/* First, look through the join's restriction clauses */
 	foreach(lc, restrictlist)
 	{
 		RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
 		OpExpr	   *opexpr;
 		Expr	   *expr1;
 		Expr	   *expr2;
+		bool		strict_op;
 		int			ipk1;
 		int			ipk2;
 
@@ -1799,11 +1804,15 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 		if (ipk1 != ipk2)
 			continue;
 
+		/* Ignore clause if we already proved these keys equal. */
+		if (pk_known_equal[ipk1])
+			continue;
+
 		/*
 		 * The clause allows partitionwise join only if it uses the same
 		 * operator family as that specified by the partition key.
 		 */
-		if (rel1->part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
 		{
 			if (!OidIsValid(rinfo->hashjoinoperator) ||
 				!op_in_opfamily(rinfo->hashjoinoperator,
@@ -1815,17 +1824,89 @@ have_partkey_equi_join(RelOptInfo *joinrel,
 			continue;
 
 		/* Mark the partition key as having an equi-join clause. */
-		pk_has_clause[ipk1] = true;
+		pk_known_equal[ipk1] = true;
+
+		/* We can stop examining clauses once we prove all keys equal. */
+		if (++num_equal_pks == part_scheme->partnatts)
+			return true;
 	}
 
-	/* Check whether every partition key has an equi-join condition. */
-	for (cnt_pks = 0; cnt_pks < part_scheme->partnatts; cnt_pks++)
+	/*
+	 * Also check to see if any keys are known equal by equivclass.c.  In most
+	 * cases there would have been a join restriction clause generated from
+	 * any EC that had such knowledge, but there might be no such clause, or
+	 * it might happen to constrain other members of the ECs than the ones we
+	 * are looking for.
+	 */
+	for (int ipk = 0; ipk < part_scheme->partnatts; ipk++)
 	{
-		if (!pk_has_clause[cnt_pks])
-			return false;
+		Oid			btree_opfamily;
+
+		/* Ignore if we already proved these keys equal. */
+		if (pk_known_equal[ipk])
+			continue;
+
+		/*
+		 * We need a btree opfamily to ask equivclass.c about.  If the
+		 * partopfamily is a hash opfamily, look up its equality operator, and
+		 * select some btree opfamily that that operator is part of.  (Any
+		 * such opfamily should be good enough, since equivclass.c will track
+		 * multiple opfamilies as appropriate.)
+		 */
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		{
+			Oid			eq_op;
+			List	   *eq_opfamilies;
+
+			eq_op = get_opfamily_member(part_scheme->partopfamily[ipk],
+										part_scheme->partopcintype[ipk],
+										part_scheme->partopcintype[ipk],
+										HTEqualStrategyNumber);
+			if (!OidIsValid(eq_op))
+				break;			/* we're not going to succeed */
+			eq_opfamilies = get_mergejoin_opfamilies(eq_op);
+			if (eq_opfamilies == NIL)
+				break;			/* we're not going to succeed */
+			btree_opfamily = linitial_oid(eq_opfamilies);
+		}
+		else
+			btree_opfamily = part_scheme->partopfamily[ipk];
+
+		/*
+		 * We consider only non-nullable partition keys here; nullable ones
+		 * would not be treated as part of the same equivalence classes as
+		 * non-nullable ones.
+		 */
+		foreach(lc, rel1->partexprs[ipk])
+		{
+			Node	   *expr1 = (Node *) lfirst(lc);
+			ListCell   *lc2;
+
+			foreach(lc2, rel2->partexprs[ipk])
+			{
+				Node	   *expr2 = (Node *) lfirst(lc2);
+
+				if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
+				{
+					pk_known_equal[ipk] = true;
+					break;
+				}
+			}
+			if (pk_known_equal[ipk])
+				break;
+		}
+
+		if (pk_known_equal[ipk])
+		{
+			/* We can stop examining keys once we prove all keys equal. */
+			if (++num_equal_pks == part_scheme->partnatts)
+				return true;
+		}
+		else
+			break;				/* no chance to succeed, give up */
 	}
 
-	return true;
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 71cbc1c3d8..fc38a71ace 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3265,10 +3265,11 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 
 		/*
 		 * Drop known-equal vars, but only if they belong to different
-		 * relations (see comments for estimate_num_groups)
+		 * relations (see comments for estimate_num_groups).  We aren't too
+		 * fussy about the semantics of "equal" here.
 		 */
 		if (vardata->rel != varinfo->rel &&
-			exprs_known_equal(root, var, varinfo->var))
+			exprs_known_equal(root, var, varinfo->var, InvalidOid))
 		{
 			if (varinfo->ndistinct <= ndistinct)
 			{
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3d95e6bfc8..6b67181f42 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -157,7 +157,8 @@ extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  Relids join_relids,
 													  Relids outer_relids,
 													  RelOptInfo *inner_rel);
-extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2);
+extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2,
+							  Oid opfamily);
 extern EquivalenceClass *match_eclasses_to_foreign_key_col(PlannerInfo *root,
 														   ForeignKeyOptInfo *fkinfo,
 														   int colno);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 03926a8413..576ef85a96 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,45 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1_1.a = t2_1.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t2_1.b
+                     ->  Seq Scan on prt2_p1 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p2 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_3.a = t2_3.a)
+               ->  Seq Scan on prt1_p3 t1_3
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_3
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 67f506361f..8162ea4172 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,11 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, with whole-row reference; partitionwise join does not apply
 EXPLAIN (COSTS OFF)
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-- 
2.25.1

#33Jacob Champion
jchampion@timescale.com
In reply to: Richard Guo (#32)
Re: A problem about partitionwise join

As discussed in [1]/messages/by-id/flat/0ab66589-2f71-69b3-2002-49e821740b0d@timescale.com, we're taking this opportunity to return some
patchsets that don't appear to be getting enough reviewer interest.

This is not a rejection, since we don't necessarily think there's
anything unacceptable about the entry, but it differs from a standard
"Returned with Feedback" in that there's probably not much actionable
feedback at all. Rather than code changes, what this patch needs is more
community interest. You might

- ask people for help with your approach,
- see if there are similar patches that your code could supplement,
- get interested parties to agree to review your patch in a CF, or
- possibly present the functionality in a way that's easier to review
overall.

(Doing these things is no guarantee that there will be interest, but
it's hopefully better than endlessly rebasing a patchset that is not
receiving any feedback from the community.)

Once you think you've built up some community support and the patchset
is ready for review, you (or any interested party) can resurrect the
patch entry by visiting

https://commitfest.postgresql.org/38/2266/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob

[1]: /messages/by-id/flat/0ab66589-2f71-69b3-2002-49e821740b0d@timescale.com
/messages/by-id/flat/0ab66589-2f71-69b3-2002-49e821740b0d@timescale.com

#34Richard Guo
guofenglinux@gmail.com
In reply to: Jacob Champion (#33)
1 attachment(s)
Re: A problem about partitionwise join

On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion <jchampion@timescale.com>
wrote:

Once you think you've built up some community support and the patchset
is ready for review, you (or any interested party) can resurrect the
patch entry by visiting

https://commitfest.postgresql.org/38/2266/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

This patch was returned due to 'lack of interest'. However, upon
verification, it appears that the reported issue still exists, and the
proposed fix in the thread remains valid. Hence, resurrect this patch
after rebasing it on master. I've also written a detailed commit
message which hopefully can help people review the changes more
effectively.

Thanks
Richard

Attachments:

v7-0001-Fix-partitionwise-join-with-partially-redundant-join-clauses.patchapplication/octet-stream; name=v7-0001-Fix-partitionwise-join-with-partially-redundant-join-clauses.patchDownload
From 5f722a6d31b166c8a0597081f9ff5c0e54effac6 Mon Sep 17 00:00:00 2001
From: Richard Guo <guofenglinux@gmail.com>
Date: Wed, 21 Feb 2024 16:54:43 +0800
Subject: [PATCH v7] Fix partitionwise join with partially-redundant join
 clauses

To determine if the two relations being joined can use partitionwise
join, we need to verify the existence of equi-join conditions involving
all the partition keys.  Currently we do that by looking through the
join's restriction clauses.  However, it has been discovered that this
approach is insufficient, because there might be partition keys known
equal by a specific EC, but they do not form a join clause because it
happens that other members of the EC than the partition keys are
constrained to become a join clause.

To address this issue, in addition to examining the join's restriction
clauses, we also check if any partition keys are known equal by ECs, by
leveraging function exprs_known_equal().  To accomplish this, we enhance
exprs_known_equal() to check equality per the semantics of the opfamily,
if provided.
---
 src/backend/optimizer/path/equivclass.c      |  26 +++--
 src/backend/optimizer/util/relnode.c         | 103 +++++++++++++++++--
 src/backend/utils/adt/selfuncs.c             |   5 +-
 src/include/optimizer/paths.h                |   3 +-
 src/test/regress/expected/partition_join.out |  39 +++++++
 src/test/regress/sql/partition_join.sql      |   5 +
 6 files changed, 160 insertions(+), 21 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 4bd60a09c6..1890dbb852 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2439,15 +2439,17 @@ find_join_domain(PlannerInfo *root, Relids relids)
  *	  Detect whether two expressions are known equal due to equivalence
  *	  relationships.
  *
- * Actually, this only shows that the expressions are equal according
- * to some opfamily's notion of equality --- but we only use it for
- * selectivity estimation, so a fuzzy idea of equality is OK.
+ * If opfamily is given, the expressions must be known equal per the semantics
+ * of that opfamily (note it has to be a btree opfamily, since those are the
+ * only opfamilies equivclass.c deals with).  If opfamily is InvalidOid, we'll
+ * return true if they're equal according to any opfamily, which is fuzzy but
+ * OK for estimation purposes.
  *
  * Note: does not bother to check for "equal(item1, item2)"; caller must
  * check that case if it's possible to pass identical items.
  */
 bool
-exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
+exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2, Oid opfamily)
 {
 	ListCell   *lc1;
 
@@ -2462,6 +2464,17 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
 		if (ec->ec_has_volatile)
 			continue;
 
+		/*
+		 * It's okay to consider ec_broken ECs here.  Brokenness just means we
+		 * couldn't derive all the implied clauses we'd have liked to; it does
+		 * not invalidate our knowledge that the members are equal.
+		 */
+
+		/* Ignore if this EC doesn't use specified opfamily */
+		if (OidIsValid(opfamily) &&
+			!list_member_oid(ec->ec_opfamilies, opfamily))
+			continue;
+
 		foreach(lc2, ec->ec_members)
 		{
 			EquivalenceMember *em = (EquivalenceMember *) lfirst(lc2);
@@ -2490,8 +2503,7 @@ exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2)
  * (In principle there might be more than one matching eclass if multiple
  * collations are involved, but since collation doesn't matter for equality,
  * we ignore that fine point here.)  This is much like exprs_known_equal,
- * except that we insist on the comparison operator matching the eclass, so
- * that the result is definite not approximate.
+ * except for the format of the input.
  *
  * On success, we also set fkinfo->eclass[colno] to the matching eclass,
  * and set fkinfo->fk_eclass_member[colno] to the eclass member for the
@@ -2532,7 +2544,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
 		/* Never match to a volatile EC */
 		if (ec->ec_has_volatile)
 			continue;
-		/* Note: it seems okay to match to "broken" eclasses here */
+		/* It's okay to consider "broken" ECs here, see exprs_known_equal */
 
 		foreach(lc2, ec->ec_members)
 		{
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e5f4062bfb..6e1cd9490b 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -2073,10 +2073,9 @@ have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 					   JoinType jointype, List *restrictlist)
 {
 	PartitionScheme part_scheme = rel1->part_scheme;
+	bool		pk_known_equal[PARTITION_MAX_KEYS];
+	int			num_equal_pks;
 	ListCell   *lc;
-	int			cnt_pks;
-	bool		pk_has_clause[PARTITION_MAX_KEYS];
-	bool		strict_op;
 
 	/*
 	 * This function must only be called when the joined relations have same
@@ -2085,13 +2084,19 @@ have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 	Assert(rel1->part_scheme == rel2->part_scheme);
 	Assert(part_scheme);
 
-	memset(pk_has_clause, 0, sizeof(pk_has_clause));
+	/* We use a bool array to track which partkey columns are known equal */
+	memset(pk_known_equal, 0, sizeof(pk_known_equal));
+	/* ... as well as a count of how many are known equal */
+	num_equal_pks = 0;
+
+	/* First, look through the join's restriction clauses */
 	foreach(lc, restrictlist)
 	{
 		RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
 		OpExpr	   *opexpr;
 		Expr	   *expr1;
 		Expr	   *expr2;
+		bool		strict_op;
 		int			ipk1;
 		int			ipk2;
 
@@ -2169,11 +2174,15 @@ have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 		if (ipk1 != ipk2)
 			continue;
 
+		/* Ignore clause if we already proved these keys equal. */
+		if (pk_known_equal[ipk1])
+			continue;
+
 		/*
 		 * The clause allows partitionwise join only if it uses the same
 		 * operator family as that specified by the partition key.
 		 */
-		if (rel1->part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
 		{
 			if (!OidIsValid(rinfo->hashjoinoperator) ||
 				!op_in_opfamily(rinfo->hashjoinoperator,
@@ -2185,17 +2194,89 @@ have_partkey_equi_join(PlannerInfo *root, RelOptInfo *joinrel,
 			continue;
 
 		/* Mark the partition key as having an equi-join clause. */
-		pk_has_clause[ipk1] = true;
+		pk_known_equal[ipk1] = true;
+
+		/* We can stop examining clauses once we prove all keys equal. */
+		if (++num_equal_pks == part_scheme->partnatts)
+			return true;
 	}
 
-	/* Check whether every partition key has an equi-join condition. */
-	for (cnt_pks = 0; cnt_pks < part_scheme->partnatts; cnt_pks++)
+	/*
+	 * Also check to see if any keys are known equal by equivclass.c.  In most
+	 * cases there would have been a join restriction clause generated from
+	 * any EC that had such knowledge, but there might be no such clause, or
+	 * it might happen to constrain other members of the ECs than the ones we
+	 * are looking for.
+	 */
+	for (int ipk = 0; ipk < part_scheme->partnatts; ipk++)
 	{
-		if (!pk_has_clause[cnt_pks])
-			return false;
+		Oid			btree_opfamily;
+
+		/* Ignore if we already proved these keys equal. */
+		if (pk_known_equal[ipk])
+			continue;
+
+		/*
+		 * We need a btree opfamily to ask equivclass.c about.  If the
+		 * partopfamily is a hash opfamily, look up its equality operator, and
+		 * select some btree opfamily that that operator is part of.  (Any
+		 * such opfamily should be good enough, since equivclass.c will track
+		 * multiple opfamilies as appropriate.)
+		 */
+		if (part_scheme->strategy == PARTITION_STRATEGY_HASH)
+		{
+			Oid			eq_op;
+			List	   *eq_opfamilies;
+
+			eq_op = get_opfamily_member(part_scheme->partopfamily[ipk],
+										part_scheme->partopcintype[ipk],
+										part_scheme->partopcintype[ipk],
+										HTEqualStrategyNumber);
+			if (!OidIsValid(eq_op))
+				break;			/* we're not going to succeed */
+			eq_opfamilies = get_mergejoin_opfamilies(eq_op);
+			if (eq_opfamilies == NIL)
+				break;			/* we're not going to succeed */
+			btree_opfamily = linitial_oid(eq_opfamilies);
+		}
+		else
+			btree_opfamily = part_scheme->partopfamily[ipk];
+
+		/*
+		 * We consider only non-nullable partition keys here; nullable ones
+		 * would not be treated as part of the same equivalence classes as
+		 * non-nullable ones.
+		 */
+		foreach(lc, rel1->partexprs[ipk])
+		{
+			Node	   *expr1 = (Node *) lfirst(lc);
+			ListCell   *lc2;
+
+			foreach(lc2, rel2->partexprs[ipk])
+			{
+				Node	   *expr2 = (Node *) lfirst(lc2);
+
+				if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
+				{
+					pk_known_equal[ipk] = true;
+					break;
+				}
+			}
+			if (pk_known_equal[ipk])
+				break;
+		}
+
+		if (pk_known_equal[ipk])
+		{
+			/* We can stop examining keys once we prove all keys equal. */
+			if (++num_equal_pks == part_scheme->partnatts)
+				return true;
+		}
+		else
+			break;				/* no chance to succeed, give up */
 	}
 
-	return true;
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d4..d1365229f7 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3313,10 +3313,11 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 
 		/*
 		 * Drop known-equal vars, but only if they belong to different
-		 * relations (see comments for estimate_num_groups)
+		 * relations (see comments for estimate_num_groups).  We aren't too
+		 * fussy about the semantics of "equal" here.
 		 */
 		if (vardata->rel != varinfo->rel &&
-			exprs_known_equal(root, var, varinfo->var))
+			exprs_known_equal(root, var, varinfo->var, InvalidOid))
 		{
 			if (varinfo->ndistinct <= ndistinct)
 			{
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0e8a9c94ba..2ad42ccadf 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -159,7 +159,8 @@ extern List *generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 													  Relids join_relids,
 													  Relids outer_relids,
 													  RelOptInfo *inner_rel);
-extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2);
+extern bool exprs_known_equal(PlannerInfo *root, Node *item1, Node *item2,
+							  Oid opfamily);
 extern EquivalenceClass *match_eclasses_to_foreign_key_col(PlannerInfo *root,
 														   ForeignKeyOptInfo *fkinfo,
 														   int colno);
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6560fe2416..909daf33dc 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -62,6 +62,45 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b =
  450 | 0450 | 450 | 0450
 (4 rows)
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Sort
+   Sort Key: t1.a
+   ->  Append
+         ->  Merge Join
+               Merge Cond: (t1_1.a = t2_1.a)
+               ->  Index Scan using iprt1_p1_a on prt1_p1 t1_1
+               ->  Sort
+                     Sort Key: t2_1.b
+                     ->  Seq Scan on prt2_p1 t2_1
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_2.a = t2_2.a)
+               ->  Seq Scan on prt1_p2 t1_2
+               ->  Hash
+                     ->  Seq Scan on prt2_p2 t2_2
+                           Filter: (a = b)
+         ->  Hash Join
+               Hash Cond: (t1_3.a = t2_3.a)
+               ->  Seq Scan on prt1_p3 t1_3
+               ->  Hash
+                     ->  Seq Scan on prt2_p3 t2_3
+                           Filter: (a = b)
+(22 rows)
+
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+ a  |  c   | b  |  c   
+----+------+----+------
+  0 | 0000 |  0 | 0000
+  6 | 0006 |  6 | 0006
+ 12 | 0012 | 12 | 0012
+ 18 | 0018 | 18 | 0018
+ 24 | 0024 | 24 | 0024
+(5 rows)
+
 -- left outer join, 3-way
 EXPLAIN (COSTS OFF)
 SELECT COUNT(*) FROM prt1 t1
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index 48daf3aee3..eea6df580e 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -34,6 +34,11 @@ EXPLAIN (COSTS OFF)
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
 
+-- inner join with partially-redundant join clauses
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a AND t1.a = t2.b ORDER BY t1.a, t2.b;
+
 -- left outer join, 3-way
 EXPLAIN (COSTS OFF)
 SELECT COUNT(*) FROM prt1 t1
-- 
2.31.0

#35Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Richard Guo (#34)
Re: A problem about partitionwise join

On Wed, Feb 21, 2024 at 4:55 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion <jchampion@timescale.com> wrote:

Once you think you've built up some community support and the patchset
is ready for review, you (or any interested party) can resurrect the
patch entry by visiting

https://commitfest.postgresql.org/38/2266/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

This patch was returned due to 'lack of interest'. However, upon
verification, it appears that the reported issue still exists, and the
proposed fix in the thread remains valid. Hence, resurrect this patch
after rebasing it on master. I've also written a detailed commit
message which hopefully can help people review the changes more
effectively.

The concept looks useful. The SQL statement added in the test looks
cooked though (it outputs data that has same value for two columns
which is equal to primary key of other table - when would somebody do
that?). Is there some real life example of this?

The patch uses restrictclauses as well as EC's. Tom has proposed to
make EC work with outer joins sensibly. Has that happened? Can this
patch leverage it rather than having two loops?

--
Best Wishes,
Ashutosh Bapat

#36Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Ashutosh Bapat (#35)
Re: A problem about partitionwise join

On Thu, Feb 22, 2024 at 2:56 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

On Wed, Feb 21, 2024 at 4:55 PM Richard Guo <guofenglinux@gmail.com>
wrote:

On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion <jchampion@timescale.com>

wrote:

Once you think you've built up some community support and the patchset
is ready for review, you (or any interested party) can resurrect the
patch entry by visiting

https://commitfest.postgresql.org/38/2266/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

This patch was returned due to 'lack of interest'. However, upon
verification, it appears that the reported issue still exists, and the
proposed fix in the thread remains valid. Hence, resurrect this patch
after rebasing it on master. I've also written a detailed commit
message which hopefully can help people review the changes more
effectively.

I did a deeper review of the patch. Here are some comments

Approach
--------
The equijoin condition between partition keys doesn't appear in the join's
restrictilist because of 'best_score' strategy as you explained well in
[2]: /messages/by-id/CAN_9JTxucGdVY9tV6Uxq0CdhrW98bZtxPKFbF_75qdPi5wBaow@mail.gmail.com
give preference to equijoin between partition keys? Have you given it a
thought? I feel that having an equijoin clause involving partition keys has
more usages compared to a clause with any random column. E.g. nextloop may
be able to prune partitions from inner relation if the clause contains a
partition key.

Partition pruning requires equality clauses on partition keys as well.
create_append_plan() fetches those from best_path->param_info. If we
created and saved the clauses involving partition keys somewhere
separately, similar to the clauses involving index keys, it might help this
case as well as the partition pruning code. Have you considered this idea?

There was a proposal to use ECs for outer joins as well and then use only
ECs to decide whether equijoins between partition keys exist. I don't think
the proposal has materialized. So we have to continue looking at
restrictlist as well. I don't see a point waiting for it, but others might
feel differently.

I am just trying to find ways to avoid two loops in
have_partkey_equi_join(). If the alternatives are worse, I think the
current approach is fine.

Documentation
-------------
The patch does not modify any documentation. The only documentation I could
find about partitionwise join is the one for GUC
'enable_partitionwise_join'. It says
--- quote
"Partitionwise join currently applies only when the join conditions include
all the partition keys, which must be of the same data type and have
one-to-one matching sets of child partitions.".
--- unquote
This sentence is general and IMO covers the case this patch considers. But
in general I feel that partitionwise join and aggregation deserve separate
sections next to "partition pruning" in [1]; It should mention advanced
partition matching algorithm as well. Would you be willing to write one and
then expand it for the case in the patch?

Tests
-----
The patch adds a testcase for single column partitioning. I think we need
to do better like
1. Test for partitioning on expression, multilevel partitioning, advanced
partition matching. Those all might just work. Having tests helps us to
notice any future breakage.
2. Some negative test case e.g. equijoin clauses with disjunction, with
inequality operator, equality operators with operators from different
families etc.
3. The testcase added looks artificial. it outputs data that has same value
for two columns which is equal to the primary key of the other table - when
would somebody do that?. Is there some real life example where this change
will be useful?

Code
----
Minor comment for now. It will be better to increment num_equal_pks
immediately after setting pk_known_equal[ipk] = true. Otherwise the code
gets confusing around line 2269. I will spend more time reviewing the code
next week.

[1]: https://www.postgresql.org/docs/current/ddl-partitioning.html
[2]: /messages/by-id/CAN_9JTxucGdVY9tV6Uxq0CdhrW98bZtxPKFbF_75qdPi5wBaow@mail.gmail.com
/messages/by-id/CAN_9JTxucGdVY9tV6Uxq0CdhrW98bZtxPKFbF_75qdPi5wBaow@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

#37Richard Guo
guofenglinux@gmail.com
In reply to: Ashutosh Bapat (#36)
Re: A problem about partitionwise join

(Sorry it takes me some time to get back to this thread.)

On Thu, Mar 7, 2024 at 7:13 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

I did a deeper review of the patch. Here are some comments

Thank you for the review!

Approach
--------
The equijoin condition between partition keys doesn't appear in the join's
restrictilist because of 'best_score' strategy as you explained well in
[2]. What if we add an extra score for clauses between partition keys and
give preference to equijoin between partition keys? Have you given it a
thought? I feel that having an equijoin clause involving partition keys has
more usages compared to a clause with any random column. E.g. nextloop may
be able to prune partitions from inner relation if the clause contains a
partition key.

Hmm, I think this approach won't work in cases where one certain pair of
partition keys has formed an EC that contains pseudoconstants. In such
cases, the EC machinery will generate restriction clauses like 'pk =
const' rather than any join clauses.

Besides, it seems to me that it's not a cheap operation to check whether
a join clause is between partition keys when we generate join clauses
from ECs in generate_join_implied_equalities().

Documentation
-------------
The patch does not modify any documentation. The only documentation I
could find about partitionwise join is the one for GUC
'enable_partitionwise_join'. It says
--- quote
"Partitionwise join currently applies only when the join conditions
include all the partition keys, which must be of the same data type and
have one-to-one matching sets of child partitions.".
--- unquote
This sentence is general and IMO covers the case this patch considers. But
in general I feel that partitionwise join and aggregation deserve separate
sections next to "partition pruning" in [1]; It should mention advanced
partition matching algorithm as well. Would you be willing to write one and
then expand it for the case in the patch?

I don't think it should be part of this patch to add a new section in
the docs to explain partitionwise join and aggregation. Maybe that
deserves a separate patch?

Tests
-----
The patch adds a testcase for single column partitioning. I think we need
to do better like
1. Test for partitioning on expression, multilevel partitioning, advanced
partition matching. Those all might just work. Having tests helps us to
notice any future breakage.
2. Some negative test case e.g. equijoin clauses with disjunction, with
inequality operator, equality operators with operators from different
families etc.

Thanks for the suggestions. We can do that.

3. The testcase added looks artificial. it outputs data that has same
value for two columns which is equal to the primary key of the other table
- when would somebody do that?. Is there some real life example where this
change will be useful?

Hmm, I think the test case is good as long as it reveals the issue that
this patch fixes. It follows the same format as the existing test case
just above it. I'm not sure if there are real life examples, but I
think it may not always be necessary to derive test cases from them.

Code
----
Minor comment for now. It will be better to increment num_equal_pks
immediately after setting pk_known_equal[ipk] = true. Otherwise the code
gets confusing around line 2269. I will spend more time reviewing the code
next week.

Hmm, the increment of num_equal_pks on line 2272 is parallel to the one
in the first loop (around line 2200). Maybe it's better to keep them
consistent as the current patch does?

Thanks
Richard

#38Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Richard Guo (#37)
Re: A problem about partitionwise join

On Tue, Mar 19, 2024 at 8:18 AM Richard Guo <guofenglinux@gmail.com> wrote:

(Sorry it takes me some time to get back to this thread.)

On Thu, Mar 7, 2024 at 7:13 PM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

I did a deeper review of the patch. Here are some comments

Thank you for the review!

Approach
--------
The equijoin condition between partition keys doesn't appear in the
join's restrictilist because of 'best_score' strategy as you explained well
in [2]. What if we add an extra score for clauses between partition keys
and give preference to equijoin between partition keys? Have you given it a
thought? I feel that having an equijoin clause involving partition keys has
more usages compared to a clause with any random column. E.g. nextloop may
be able to prune partitions from inner relation if the clause contains a
partition key.

Hmm, I think this approach won't work in cases where one certain pair of
partition keys has formed an EC that contains pseudoconstants. In such
cases, the EC machinery will generate restriction clauses like 'pk =
const' rather than any join clauses.

That should be ok and more desirable. Clauses like pk = const will leave
only one partition around in each of the joining relations thus PWJ won't
be required OR it will be automatic - whichever way you see it.

Besides, it seems to me that it's not a cheap operation to check whether
a join clause is between partition keys when we generate join clauses
from ECs in generate_join_implied_equalities().

Why? The code would be the same as what we have
in have_partkey_equi_join().

Documentation

-------------
The patch does not modify any documentation. The only documentation I
could find about partitionwise join is the one for GUC
'enable_partitionwise_join'. It says
--- quote
"Partitionwise join currently applies only when the join conditions
include all the partition keys, which must be of the same data type and
have one-to-one matching sets of child partitions.".
--- unquote
This sentence is general and IMO covers the case this patch considers.
But in general I feel that partitionwise join and aggregation deserve
separate sections next to "partition pruning" in [1]; It should mention
advanced partition matching algorithm as well. Would you be willing to
write one and then expand it for the case in the patch?

I don't think it should be part of this patch to add a new section in
the docs to explain partitionwise join and aggregation. Maybe that
deserves a separate patch?

Yes.

3. The testcase added looks artificial. it outputs data that has same value

for two columns which is equal to the primary key of the other table - when
would somebody do that?. Is there some real life example where this change
will be useful?

Hmm, I think the test case is good as long as it reveals the issue that
this patch fixes. It follows the same format as the existing test case
just above it. I'm not sure if there are real life examples, but I
think it may not always be necessary to derive test cases from them.

Let's defer this to the committer.

Code
----
Minor comment for now. It will be better to increment num_equal_pks
immediately after setting pk_known_equal[ipk] = true. Otherwise the code
gets confusing around line 2269. I will spend more time reviewing the code
next week.

Hmm, the increment of num_equal_pks on line 2272 is parallel to the one
in the first loop (around line 2200). Maybe it's better to keep them
consistent as the current patch does?

In the first loop, setting pk_known_equal[ipk1] = true and ++num_equal_pks
happens on consecutive lines. That's not true in the second loop, where
there are at least some code line where num_equal_pks is inconsistent with
the number of "true" entries in pk_known_equal. We should avoid that.

--
Best Wishes,
Ashutosh Bapat

#39Richard Guo
guofenglinux@gmail.com
In reply to: Ashutosh Bapat (#38)
Re: A problem about partitionwise join

On Tue, Mar 19, 2024 at 3:40 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

On Tue, Mar 19, 2024 at 8:18 AM Richard Guo <guofenglinux@gmail.com>
wrote:

On Thu, Mar 7, 2024 at 7:13 PM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

Approach
--------
The equijoin condition between partition keys doesn't appear in the
join's restrictilist because of 'best_score' strategy as you explained well
in [2]. What if we add an extra score for clauses between partition keys
and give preference to equijoin between partition keys? Have you given it a
thought? I feel that having an equijoin clause involving partition keys has
more usages compared to a clause with any random column. E.g. nextloop may
be able to prune partitions from inner relation if the clause contains a
partition key.

Hmm, I think this approach won't work in cases where one certain pair of
partition keys has formed an EC that contains pseudoconstants. In such
cases, the EC machinery will generate restriction clauses like 'pk =
const' rather than any join clauses.

That should be ok and more desirable. Clauses like pk = const will leave
only one partition around in each of the joining relations thus PWJ won't
be required OR it will be automatic - whichever way you see it.

No, that's not true. There could be multiple partition keys, and the
particular key involved in the pushed-down restriction 'pk = const' may
not be able to prune away any partitions. To be concrete, consider the
query:

create table p (k1 int, k2 int, val int) partition by range(k1, k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

set enable_partitionwise_join to on;

explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 = bar.k2
and foo.k2 = 5;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k1 = bar.k1)
-> Append
-> Seq Scan on p_1 foo_1
Filter: (k2 = 5)
-> Seq Scan on p_2 foo_2
Filter: (k2 = 5)
-> Hash
-> Append
-> Seq Scan on p_1 bar_1
Filter: (k2 = 5)
-> Seq Scan on p_2 bar_2
Filter: (k2 = 5)
(13 rows)

Thanks
Richard

#40Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Richard Guo (#39)
Re: A problem about partitionwise join

On Mon, Mar 25, 2024 at 9:01 AM Richard Guo <guofenglinux@gmail.com> wrote:

create table p (k1 int, k2 int, val int) partition by range(k1, k2);
create table p_1 partition of p for values from (1,1) to (10,100);
create table p_2 partition of p for values from (10,100) to (20,200);

set enable_partitionwise_join to on;

explain (costs off)
select * from p as foo join p as bar on foo.k1 = bar.k1 and foo.k2 =
bar.k2 and foo.k2 = 5;
QUERY PLAN
-----------------------------------------
Hash Join
Hash Cond: (foo.k1 = bar.k1)
-> Append
-> Seq Scan on p_1 foo_1
Filter: (k2 = 5)
-> Seq Scan on p_2 foo_2
Filter: (k2 = 5)
-> Hash
-> Append
-> Seq Scan on p_1 bar_1
Filter: (k2 = 5)
-> Seq Scan on p_2 bar_2
Filter: (k2 = 5)
(13 rows)

Thanks for the example. You are right.

I think we need some way to avoid two different ways of looking up
partition keys - if we can't teach the EC machinery to produce clauses with
partition keys (always), we need to teach EC to contain partition keys in
case of outer joins. Tom alluded to this but I haven't seen any proposal.
The potential danger with the current patch is that it will continue to
have two loops even if we fix one of the above cases in future.

--
Best Wishes,
Ashutosh Bapat

#41Robert Haas
robertmhaas@gmail.com
In reply to: Richard Guo (#34)
Re: A problem about partitionwise join

On Wed, Feb 21, 2024 at 6:25 AM Richard Guo <guofenglinux@gmail.com> wrote:

This patch was returned due to 'lack of interest'. However, upon
verification, it appears that the reported issue still exists, and the
proposed fix in the thread remains valid. Hence, resurrect this patch
after rebasing it on master. I've also written a detailed commit
message which hopefully can help people review the changes more
effectively.

I think it's slightly questionable whether this patch is worthwhile.
The case memorialized in the regression tests, t1.a = t2.a AND t1.a =
t2.b, is a very weird thing to do. The case mentioned in the original
email, foo.k1 = bar.k1 and foo.k2 = bar.k2 and foo.k2 = 16, seems like
something that could realistically happen, especially when there are
views in use (e.g. the view joins foo and bar, and then someone
queries the view for one of the join columns). In such a case, it's
possible that foo.k2 = 16 is selective enough that we really don't
care about partition-wise join any more, but it's also possible that
it's not too selective and we do care about partition-wise join. So I
don't think that the case that the patch fixes is something that can
ever happen, but I do think it's probably fairly rare that brings any
benefit, which is why I thought that EC-based matching was an OK
approach to this problem initially. Perhaps that was the wrong idea,
though.

Does the additional logic added by this patch have a noticeable
performance cost?

--
Robert Haas
EDB: http://www.enterprisedb.com

#42Richard Guo
guofenglinux@gmail.com
In reply to: Robert Haas (#41)
Re: A problem about partitionwise join

On Wed, May 1, 2024 at 1:31 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think it's slightly questionable whether this patch is worthwhile.
The case memorialized in the regression tests, t1.a = t2.a AND t1.a =
t2.b, is a very weird thing to do. The case mentioned in the original
email, foo.k1 = bar.k1 and foo.k2 = bar.k2 and foo.k2 = 16, seems like
something that could realistically happen, especially when there are
views in use (e.g. the view joins foo and bar, and then someone
queries the view for one of the join columns). In such a case, it's
possible that foo.k2 = 16 is selective enough that we really don't
care about partition-wise join any more, but it's also possible that
it's not too selective and we do care about partition-wise join. So I
don't think that the case that the patch fixes is something that can
ever happen, but I do think it's probably fairly rare that brings any
benefit, which is why I thought that EC-based matching was an OK
approach to this problem initially. Perhaps that was the wrong idea,
though.

Thank you for taking the time to review this patch! I think Ashutosh
also mentioned that the new added test case looks artificial. I must
admit that I'm not too sure how common we encounter queries with
partially-redundant join clauses in real-life scenarios. It is possible
that such cases are quite rare, and this patch will then not be of much
use.

I initially brought up this issue because I noticed an inconsistency
regarding the generation of a partition-wise join: with 't1.k = t2.k and
t1.k = t2.val' we are able to generate a partition-wise join, while its
equivalent form 't1.k = t2.val and t1.k = t2.k' does not result in a
partition-wise join. I think this inconsistency could be confusing.

The reason behind this is that with 't1.k = t2.val and t1.k = t2.k' it
happens to constrain other members (t1.k and t2.val) of the EC than the
ones we are looking for (t1.k and t2.k). Our current code looks through
the join's restriction clauses for matched keys. In addition to that,
this patch checks to see if any unmatched keys are known equal by ECs,
leveraging function exprs_known_equal().

Does the additional logic added by this patch have a noticeable
performance cost?

I think one concern regarding performance cost is that the function
exprs_known_equal() would be called O(N^2) times, where N is the number
of partition key expressions. But I think this might not be a problem.
The number of a joinrel's partition key expressions would only be equal
to the join degree, since each base relation within the join contributes
only one partition key expression, according to
set_joinrel_partition_key_exprs(). This number would not scale with the
number of partitions. But I have not measured the performance in
practice by running benchmarks. Maybe I'm just wrong.

Thanks
Richard

#43Robert Haas
robertmhaas@gmail.com
In reply to: Richard Guo (#42)
Re: A problem about partitionwise join

On Fri, May 3, 2024 at 7:47 AM Richard Guo <guofenglinux@gmail.com> wrote:

Does the additional logic added by this patch have a noticeable
performance cost?

I think one concern regarding performance cost is that the function
exprs_known_equal() would be called O(N^2) times, where N is the number
of partition key expressions. But I think this might not be a problem.
The number of a joinrel's partition key expressions would only be equal
to the join degree, since each base relation within the join contributes
only one partition key expression, according to
set_joinrel_partition_key_exprs(). This number would not scale with the
number of partitions. But I have not measured the performance in
practice by running benchmarks. Maybe I'm just wrong.

I don't know, but I do think you should do some benchmarking and see
if you can find cases where this regresses performance. In my opinion,
this feature is worth having only if it's basically free. There's lots
of things we could do in the planner that would give better (perhaps
much better) plans in certain cases, but which we don't do because in
all other cases we'd pay a certain number of CPU cycles to have them
and it just doesn't make sense given how often we'd actually get a
benefit. This might be another such case.

I agree with you that the number of partition key expressions is
likely to be small, but that doesn't mean there's no problem. A big
part of the value of equivalence classes is that they let us make
deductions cheaply. Replacing that with some more complex matching
mechanism probably has a cost, and maybe that cost is material. If
it's not quite material with one partition key expression, going up to
2 or 3 of them might make it matter.

--
Robert Haas
EDB: http://www.enterprisedb.com

#44Richard Guo
guofenglinux@gmail.com
In reply to: Robert Haas (#43)
Re: A problem about partitionwise join

On Fri, May 3, 2024 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 3, 2024 at 7:47 AM Richard Guo <guofenglinux@gmail.com> wrote:

I think one concern regarding performance cost is that the function
exprs_known_equal() would be called O(N^2) times, where N is the number
of partition key expressions. But I think this might not be a problem.
The number of a joinrel's partition key expressions would only be equal
to the join degree, since each base relation within the join contributes
only one partition key expression, according to
set_joinrel_partition_key_exprs(). This number would not scale with the
number of partitions. But I have not measured the performance in
practice by running benchmarks. Maybe I'm just wrong.

I don't know, but I do think you should do some benchmarking and see
if you can find cases where this regresses performance. In my opinion,
this feature is worth having only if it's basically free. There's lots
of things we could do in the planner that would give better (perhaps
much better) plans in certain cases, but which we don't do because in
all other cases we'd pay a certain number of CPU cycles to have them
and it just doesn't make sense given how often we'd actually get a
benefit. This might be another such case.

Thank you for the suggestion. In order to obtain a rough estimation of
how this patch affects planning time, I did the following benchmarking:

* create a partitioned table with 3 keys and 1000 partitions, which
looks like

Partitioned table "public.t1_parted"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | integer | | |
b | integer | | |
c | integer | | |
d | integer | | |
Partition key: RANGE (a, b, c)
Number of partitions: 1000 (Use \d+ to list them.)

* compose a query involving 5-way joins of this partitioned table, which
looks like:

select * from t1_parted t1
natural join t1_parted t2
natural join t1_parted t3
natural join t1_parted t4
natural join t1_parted t5
where t1.b = 1 and t1.c = 2;

This query is composed in such a way that it could actually generate
partitionwise join, because there exist equi-join condition for each
pair of matching partition keys; but currently on master it is not able
to generate partitionwise join, because of the filters 't1.b = 1 and
t1.c = 2', which is the issue fixed by this patch.

* run this query 5 times with enable_partitionwise_join set to on, and
collect the average planning time on master and on patched.

To ensure fairness, on master, a little hack is required to enable the
generation of partitionwise join for this query. This allows us to
eliminate any potential impact on planning partitionwise joins and
evaluate the effects of this patch accurately.

Below is what I got on my local machine.

-- on master

measurement | average | maximum | minimum | std_dev |
std_dev_as_perc_of_avg
---------------+----------+----------+----------+---------+------------------------
planning time | 30355.07 | 33148.47 | 29020.82 | 1681.23 | 5.54%

-- on patched

measurement | average | maximum | minimum | std_dev |
std_dev_as_perc_of_avg
---------------+----------+----------+----------+---------+------------------------
planning time | 30600.00 | 33523.23 | 28680.75 | 1861.90 | 6.08%

-- without partitionwise join

measurement | average | maximum | minimum | std_dev |
std_dev_as_perc_of_avg
---------------+---------+---------+---------+---------+------------------------
planning time | 4840.18 | 5184.05 | 4528.87 | 299.98 | 6.20%

So it seems that the planning time is not significantly affected by this
patch, particularly when compared to the impact caused by partitionwise
join.

BTW, I was using Ashutosh's script [1]/messages/by-id/CAExHW5s=bCLMMq8n_bN6iU+Pjau0DS3z_6Dn6iLE69ESmsPMJQ@mail.gmail.com for setting up the benchmarking.
I find the script very handy.

[1]: /messages/by-id/CAExHW5s=bCLMMq8n_bN6iU+Pjau0DS3z_6Dn6iLE69ESmsPMJQ@mail.gmail.com
/messages/by-id/CAExHW5s=bCLMMq8n_bN6iU+Pjau0DS3z_6Dn6iLE69ESmsPMJQ@mail.gmail.com

Thanks
Richard

#45Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#44)
Re: A problem about partitionwise join

On Wed, May 8, 2024 at 5:01 PM Richard Guo <guofenglinux@gmail.com> wrote:

Below is what I got on my local machine.

-- on master

measurement | average | maximum | minimum | std_dev | std_dev_as_perc_of_avg
---------------+----------+----------+----------+---------+------------------------
planning time | 30355.07 | 33148.47 | 29020.82 | 1681.23 | 5.54%

-- on patched

measurement | average | maximum | minimum | std_dev | std_dev_as_perc_of_avg
---------------+----------+----------+----------+---------+------------------------
planning time | 30600.00 | 33523.23 | 28680.75 | 1861.90 | 6.08%

-- without partitionwise join

measurement | average | maximum | minimum | std_dev | std_dev_as_perc_of_avg
---------------+---------+---------+---------+---------+------------------------
planning time | 4840.18 | 5184.05 | 4528.87 | 299.98 | 6.20%

So it seems that the planning time is not significantly affected by this
patch, particularly when compared to the impact caused by partitionwise
join.

This benchmark shows that the impact of this patch on planning time is
within the margin of error, particularly compared to the impact of
partitionwise joins. So I've pushed this patch after working a bit
more on the commit message.

Thanks
Richard

#46Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#1)
Re: A problem about partitionwise join

On Sat, Aug 10, 2024 at 6:22 AM Alexey Dvoichenkov
<alexey@hyperplane.net> wrote:

I haven't read the entire thread so I might be missing something, but
one interesting consequence of this patch is that it kind of breaks
the initial pruning of generic plans. Given a query such as SELECT
... WHERE A.PK = B.PK AND A.PK = $1 the planner will do the right
thing for custom plans, but not for GPs since the existing logic is
not capable of pruning anything more complex than a scan. See the
attached example.

Thanks for the report! I see what the problem is. Previously, for a
join with filter 'WHERE A.PK = B.PK AND A.PK = $1', the planner was
unable to generate partitionwise join, because it failed to realize
that there exists an equi-join condition between A.PK and B.PK. As a
result, the prepared statement 'ps' was planned as a join of two
Appends in generic mode:

Nested Loop
-> Append
-> Seq Scan on a0 a_1
Filter: (x = $1)
-> Seq Scan on a1 a_2
Filter: (x = $1)
-> Materialize
-> Append
-> Seq Scan on b0 b_1
Filter: (x = $1)
-> Seq Scan on b1 b_2
Filter: (x = $1)

... and then one of the subpaths for each Append node would be pruned
during initial pruning phase, so you'd get:

Nested Loop
-> Append
Subplans Removed: 1
-> Seq Scan on a0 a_1
Filter: (x = $1)
-> Materialize
-> Append
Subplans Removed: 1
-> Seq Scan on b0 b_1
Filter: (x = $1)

With this patch, the planner is able to generate partitionwise join,
as it can recognize the equi-join condition between A.PK and B.PK from
ECs. So the prepared statement 'ps' is planned as an Append of two
joins in generic mode:

Append
-> Nested Loop
-> Seq Scan on a0 a_1
Filter: (x = $1)
-> Seq Scan on b0 b_1
Filter: (x = $1)
-> Nested Loop
-> Seq Scan on a1 a_2
Filter: (x = $1)
-> Seq Scan on b1 b_2
Filter: (x = $1)

... and neither subpath of this Append can be pruned during the
initial pruning phase.

It seems to me that this is not the fault of this patch: it fixes the
partitionwise join as expected. The ideal fix to this issue is, IMO,
to take initial pruning into account when calculating costs, so we can
pick the non-partitionwise-join path and then apply the initial
pruning if that is cheaper. Of course we also need to fix
apply_scanjoin_target_to_paths to not drop old paths of partitioned
joinrels so that we can retain non-partitionwise-join paths if
the cheapest path happens to be among them. This work is being
discussed in [1]/messages/by-id/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com.

For now, I think you can work around this issue by setting
enable_partitionwise_join to off for this query, if that works for
you.

[1]: /messages/by-id/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com

Thanks
Richard

#47Richard Guo
guofenglinux@gmail.com
In reply to: Ashutosh Bapat (#40)
Re: A problem about partitionwise join

On Mon, Mar 25, 2024 at 7:09 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

I think we need some way to avoid two different ways of looking up partition keys - if we can't teach the EC machinery to produce clauses with partition keys (always), we need to teach EC to contain partition keys in case of outer joins. Tom alluded to this but I haven't seen any proposal. The potential danger with the current patch is that it will continue to have two loops even if we fix one of the above cases in future.

Sorry for not replying to this comment before pushing the patch. I
understand your concern and agree that it would be ideal if the
partitionwise join matching logic relied solely on ECs. However,
implementing that would require a lot of changes to the EC mechanism,
and I'm not sure if that will happen in the near future. And if we do
achieve this in the future, I believe many parts of the code, not just
the loops here, will need to be modified to leverage the new EC
mechanism.

Thanks
Richard

#48Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Richard Guo (#46)
Re: A problem about partitionwise join

On Mon, Aug 12, 2024 at 8:50 AM Richard Guo <guofenglinux@gmail.com> wrote:

On Sat, Aug 10, 2024 at 6:22 AM Alexey Dvoichenkov
<alexey@hyperplane.net> wrote:

I haven't read the entire thread so I might be missing something, but
one interesting consequence of this patch is that it kind of breaks
the initial pruning of generic plans. Given a query such as SELECT
... WHERE A.PK = B.PK AND A.PK = $1 the planner will do the right
thing for custom plans, but not for GPs since the existing logic is
not capable of pruning anything more complex than a scan. See the
attached example.

Thanks for the report! I see what the problem is. Previously, for a
join with filter 'WHERE A.PK = B.PK AND A.PK = $1', the planner was
unable to generate partitionwise join, because it failed to realize
that there exists an equi-join condition between A.PK and B.PK. As a
result, the prepared statement 'ps' was planned as a join of two
Appends in generic mode:

Nested Loop
-> Append
-> Seq Scan on a0 a_1
Filter: (x = $1)
-> Seq Scan on a1 a_2
Filter: (x = $1)
-> Materialize
-> Append
-> Seq Scan on b0 b_1
Filter: (x = $1)
-> Seq Scan on b1 b_2
Filter: (x = $1)

... and then one of the subpaths for each Append node would be pruned
during initial pruning phase, so you'd get:

Nested Loop
-> Append
Subplans Removed: 1
-> Seq Scan on a0 a_1
Filter: (x = $1)
-> Materialize
-> Append
Subplans Removed: 1
-> Seq Scan on b0 b_1
Filter: (x = $1)

With this patch, the planner is able to generate partitionwise join,
as it can recognize the equi-join condition between A.PK and B.PK from
ECs. So the prepared statement 'ps' is planned as an Append of two
joins in generic mode:

Append
-> Nested Loop
-> Seq Scan on a0 a_1
Filter: (x = $1)
-> Seq Scan on b0 b_1
Filter: (x = $1)
-> Nested Loop
-> Seq Scan on a1 a_2
Filter: (x = $1)
-> Seq Scan on b1 b_2
Filter: (x = $1)

... and neither subpath of this Append can be pruned during the
initial pruning phase.

It seems to me that this is not the fault of this patch: it fixes the
partitionwise join as expected. The ideal fix to this issue is, IMO,
to take initial pruning into account when calculating costs, so we can
pick the non-partitionwise-join path and then apply the initial
pruning if that is cheaper.

This will be fine if the number of surviving partitions is only 1 (or
at most a couple), but in case the number of surviving partitions
after pruning are more than a handful, partitionwise join + runtime
partition pruning will be required.

Of course we also need to fix
apply_scanjoin_target_to_paths to not drop old paths of partitioned
joinrels so that we can retain non-partitionwise-join paths if
the cheapest path happens to be among them. This work is being
discussed in [1].

Right.

--
Best Wishes,
Ashutosh Bapat