Removing LEFT JOINs in more cases

Started by David Rowleyabout 8 years ago22 messages

david.rowley@2ndquadrant.com

about 8 years ago

1 attachment(s)

Hackers,

Normally we'll only ever remove a LEFT JOIN relation if it's unused
and there's no possibility that the join would cause row duplication.
To check that the join wouldn't cause row duplicate we make use of
proofs, such as unique indexes, or for sub-queries, we make use of
DISTINCT and GROUP BY clauses.

There's another case that we don't handle, and it's VERY simple to test for.

Quite simply, it seems we could remove the join in cases such as:

create table t1 (id int primary key);
create table t2 (id int primary key, b int not null);

insert into t2 values(1,1),(2,1);
insert into t1 values(1);

select distinct t1.* from t1 left join t2 on t1.id=t2.b;

and

select t1.id from t1 left join t2 on t1.id=t2.b GROUP BY t1.id;

but not:

select t1.id,count(*) from t1 left join t2 on t1.id=t2.b GROUP BY t1.id;

In this case, the join *can* cause row duplicates, but the distinct or
group by would filter these out again anyway, so in these cases, we'd
not only get the benefit of not joining but also not having to remove
the duplicate rows caused by the join.

Given how simple the code is to support this, it seems to me to be
worth handling.

A patch to do this is attached.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Support-removing-LEFT-JOINs-with-DISTINCT-GROUP-BY.patchapplication/octet-stream; name=0001-Support-removing-LEFT-JOINs-with-DISTINCT-GROUP-BY.patchDownload

From 42b07076105aab296f3a18bf0d1f84f667973d89 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 1 Nov 2017 12:49:20 +1300
Subject: [PATCH] Support removing LEFT JOINs with DISTINCT/GROUP BY

Normally we only remove a LEFT JOIN if its unused in the query and there
is no possibility that the LEFT JOINed table will cause any row
duplication.  If there's a DISTINCT or GROUP BY clause in the query then
and none of the DISTINCT/GROUP BY expressions are part of the
to-be-removed relation, then any row duplication that the join would cause
would be removed anyway.  Of course, we're unable to apply this if any
aggregate functions are used, as those duplicates must be counted.
---
 src/backend/optimizer/plan/analyzejoins.c | 48 ++++++++++++++++++++++------
 src/test/regress/expected/join.out        | 53 +++++++++++++++++++++++++++++++
 src/test/regress/sql/join.sql             | 16 ++++++++++
 3 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 511603b..9a9617b 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -150,10 +150,12 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
  *	  it will just duplicate its left input.
  *
  * This is true for a left join for which the join condition cannot match
- * more than one inner-side row.  (There are other possibly interesting
- * cases, but we don't have the infrastructure to prove them.)  We also
- * have to check that the inner side doesn't generate any variables needed
- * above the join.
+ * more than one inner-side row.  We can also allow removal of joins to
+ * relations that may match more than one inner-side row if a DISTINCT or
+ * GROUP BY clause would subsequently remove any duplicates caused by the
+ * join. (There are other possibly interesting cases, but we don't have the
+ * infrastructure to prove them.)  We also have to check that the inner side
+ * doesn't generate any variables needed above the join.
  */
 static bool
 join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
@@ -237,6 +239,22 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	}
 
 	/*
+	 * When a DISTINCT or GROUP BY clause is present, the unreferenced
+	 * relation's join has no ability to duplicate rows in result set, as any
+	 * duplicate rows would been removed by the DISTINCT/GROUP BY clause
+	 * anyway.  However, we're unable to apply this when aggregate functions
+	 * are present as we must aggregate any duplicated rows.  We needn't
+	 * bother checking the actual distinct/grouping exprs here, as we already
+	 * know from the above checks that no Var is present from the relation
+	 * we're trying to remove.
+	 */
+	if (root->parse->distinctClause != NIL)
+		return true;
+
+	if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+		return true;
+
+	/*
 	 * Search for mergejoinable clauses that constrain the inner rel against
 	 * either the outer rel or a pseudoconstant.  If an operator is
 	 * mergejoinable then it behaves like equality for some btree opclass, so
@@ -597,15 +615,25 @@ rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel)
 	if (rel->rtekind == RTE_RELATION)
 	{
 		/*
-		 * For a plain relation, we only know how to prove uniqueness by
-		 * reference to unique indexes.  Make sure there's at least one
-		 * suitable unique index.  It must be immediately enforced, and if
-		 * it's a partial index, it must match the query.  (Keep these
-		 * conditions in sync with relation_has_unique_index_for!)
+		 * For a plain relation, we can make use of DISTINCT or GROUP BY
+		 * clauses as unique proofs. We also know how to prove uniqueness by
+		 * reference to unique indexes.
 		 */
 		ListCell   *lc;
 
-		foreach(lc, rel->indexlist)
+		if (root->parse->distinctClause != NIL)
+			return true;
+
+		if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+			return true;
+
+		/*
+		 * Make sure there's at least one suitable unique index.  It must be
+		 * immediately enforced, and if it's a partial index, it must match
+		 * the query.  (Keep these conditions in sync with
+		 * relation_has_unique_index_for!)
+		 */
+		 foreach(lc, rel->indexlist)
 		{
 			IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
 
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index f47449b..cd25b45 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4051,6 +4051,59 @@ select d.* from d left join (select id from a union select id from b) s
  Seq Scan on d
 (1 row)
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+               QUERY PLAN                
+-----------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id, b.id, b.c_id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+             QUERY PLAN             
+------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+                   QUERY PLAN                   
+------------------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Merge Left Join
+         Merge Cond: (a.b_id = d.a)
+         ->  Sort
+               Sort Key: a.b_id
+               ->  Hash Left Join
+                     Hash Cond: (a.b_id = b.id)
+                     ->  Seq Scan on a
+                     ->  Hash
+                           ->  Seq Scan on b
+         ->  Sort
+               Sort Key: d.a
+               ->  Seq Scan on d
+(14 rows)
+
 -- check join removal with a cross-type comparison operator
 explain (costs off)
 select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index d847d53..f14e2e8 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1331,6 +1331,22 @@ explain (costs off)
 select d.* from d left join (select id from a union select id from b) s
   on d.a = s.id;
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
 -- check join removal with a cross-type comparison operator
 explain (costs off)
 select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
-- 
1.9.5.msysgit.1

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 8 years ago

In reply to: David Rowley (#1)

Re: [HACKERS] Removing LEFT JOINs in more cases

On Wed, Nov 1, 2017 at 5:39 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

In this case, the join *can* cause row duplicates, but the distinct or
group by would filter these out again anyway, so in these cases, we'd
not only get the benefit of not joining but also not having to remove
the duplicate rows caused by the join.

+1.

Given how simple the code is to support this, it seems to me to be
worth handling.

I find this patch very simple and still useful.

@@ -597,15 +615,25 @@ rel_supports_distinctness(PlannerInfo *root,
RelOptInfo *rel)
+        if (root->parse->distinctClause != NIL)
+            return true;
+
+        if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+            return true;
+

The other callers of rel_supports_distinctness() are looking for distinctness
of the given relation, whereas the code change here are applicable to any
relation, not just the given relation. I find that confusing. Looking at the
way we are calling rel_supports_distinctness() in join_is_removable() this
change looks inevitable, but still if we could reduce the confusion, that will
be good. Also if we could avoid duplication of comment about unique index, that
will be good.

DISTINCT ON allows only a subset of columns being selected to be listed in that
clause. I initially thought that specifying only a subset would be a problem
and we should check whether the DISTINCT applies to all columns being selected.
But that's not true, since DISTINCT ON would eliminate any duplicates in the
columns listed in that clause, effectively deduplicating the row being
selected. So, we are good there. May be you want to add a testcase with
DISTINCT ON.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Ashutosh Bapat (#2)

Re: [HACKERS] Removing LEFT JOINs in more cases

On Wed, Nov 22, 2017 at 10:30 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Nov 1, 2017 at 5:39 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

In this case, the join *can* cause row duplicates, but the distinct or
group by would filter these out again anyway, so in these cases, we'd
not only get the benefit of not joining but also not having to remove
the duplicate rows caused by the join.

+1.

Given how simple the code is to support this, it seems to me to be
worth handling.

I find this patch very simple and still useful.
@@ -597,15 +615,25 @@ rel_supports_distinctness(PlannerInfo *root,
RelOptInfo *rel)
+        if (root->parse->distinctClause != NIL)
+            return true;
+
+        if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+            return true;
+
The other callers of rel_supports_distinctness() are looking for distinctness
of the given relation, whereas the code change here are applicable to any
relation, not just the given relation. I find that confusing. Looking at the
way we are calling rel_supports_distinctness() in join_is_removable() this
change looks inevitable, but still if we could reduce the confusion, that will
be good. Also if we could avoid duplication of comment about unique index, that
will be good.

DISTINCT ON allows only a subset of columns being selected to be listed in that
clause. I initially thought that specifying only a subset would be a problem
and we should check whether the DISTINCT applies to all columns being selected.
But that's not true, since DISTINCT ON would eliminate any duplicates in the
columns listed in that clause, effectively deduplicating the row being
selected. So, we are good there. May be you want to add a testcase with
DISTINCT ON.

I am counting that as a review, which got no replies yet. The thing is
somewhat fresh so I am moving it to next CF with waiting on author as
status.
--
Michael

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Ashutosh Bapat (#2)

1 attachment(s)

Re: [HACKERS] Removing LEFT JOINs in more cases

Thanks for looking over this and my apologies for the delay in my
response. I've been on leave and out of range of the internet for most
of that time.

On 23 November 2017 at 02:30, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

@@ -597,15 +615,25 @@ rel_supports_distinctness(PlannerInfo *root,
RelOptInfo *rel)
+        if (root->parse->distinctClause != NIL)
+            return true;
+
+        if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+            return true;
+
The other callers of rel_supports_distinctness() are looking for distinctness
of the given relation, whereas the code change here are applicable to any
relation, not just the given relation. I find that confusing. Looking at the
way we are calling rel_supports_distinctness() in join_is_removable() this
change looks inevitable, but still if we could reduce the confusion, that will
be good. Also if we could avoid duplication of comment about unique index, that
will be good.

When I put this together I'd considered leaving
rel_supports_distinctness() alone since the other uses of the function
can't make use of the new checks. Since you mention this, then I think
it's likely better just to go with that and just special case the code
in join_is_removable() to check for rel_supports_distinctness() and
the DISTINCT/GROUP BY clauses too. This will mean less false positives
for usages such as in innerrel_is_unique() which would end up causing
a bit of extra work before possibly failing a bit later on. I've made
changes in the attached to do things this way.

May be you want to add a testcase with DISTINCT ON.

Good idea. I've also added a test case for this, and also one to
ensure non-RTE_RELATION relations can be removed too.

Updated patch attached.

Thanks again for the review.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

remove_left_join_distinct_2017-11-30.patchapplication/octet-stream; name=remove_left_join_distinct_2017-11-30.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 5783f90..615ae59 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -150,10 +150,12 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
  *	  it will just duplicate its left input.
  *
  * This is true for a left join for which the join condition cannot match
- * more than one inner-side row.  (There are other possibly interesting
- * cases, but we don't have the infrastructure to prove them.)  We also
- * have to check that the inner side doesn't generate any variables needed
- * above the join.
+ * more than one inner-side row.  We can also allow removal of joins to
+ * relations that may match more than one inner-side row if a DISTINCT or
+ * GROUP BY clause would subsequently remove any duplicates caused by the
+ * join. (There are other possibly interesting cases, but we don't have the
+ * infrastructure to prove them.)  We also have to check that the inner side
+ * doesn't generate any variables needed above the join.
  */
 static bool
 join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
@@ -181,9 +183,11 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	/*
 	 * Before we go to the effort of checking whether any innerrel variables
 	 * are needed above the join, make a quick check to eliminate cases in
-	 * which we will surely be unable to prove uniqueness of the innerrel.
+	 * which we will surely be unable to remove the join.
 	 */
-	if (!rel_supports_distinctness(root, innerrel))
+	if (!rel_supports_distinctness(root, innerrel) &&
+		root->parse->distinctClause == NIL &&
+		(root->parse->groupClause == NIL || root->parse->hasAggs))
 		return false;
 
 	/* Compute the relid set for the join we are considering */
@@ -237,6 +241,22 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	}
 
 	/*
+	 * When a DISTINCT, DISTINCT ON or GROUP BY clause is present, the
+	 * unreferenced relation's join has no ability to duplicate rows in final
+	 * result set, as any duplicate rows would be removed by the
+	 * DISTINCT/GROUP BY clause anyway.  However, we're unable to apply this
+	 * when aggregate functions are present as we must aggregate any
+	 * duplicated rows.  We needn't bother checking the actual distinct or
+	 * grouping expressions here, as we already know from the above checks
+	 * that no Var is present from the relation we're trying to remove.
+	 */
+	if (root->parse->distinctClause != NIL)
+		return true;
+
+	if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+		return true;
+
+	/*
 	 * Search for mergejoinable clauses that constrain the inner rel against
 	 * either the outer rel or a pseudoconstant.  If an operator is
 	 * mergejoinable then it behaves like equality for some btree opclass, so
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index f02ef3f..75ad4e1 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4051,6 +4051,93 @@ select d.* from d left join (select id from a union select id from b) s
  Seq Scan on d
 (1 row)
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+               QUERY PLAN                
+-----------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id, b.id, b.c_id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- as above but with DISTINCT ON
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+                QUERY PLAN                
+------------------------------------------
+ Unique
+   ->  Sort
+         Sort Key: a.id, b.id, a.b_id
+         ->  Hash Left Join
+               Hash Cond: (a.b_id = b.id)
+               ->  Seq Scan on a
+               ->  Hash
+                     ->  Seq Scan on b
+(8 rows)
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+             QUERY PLAN             
+------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- also ensure the removal works with other relation types.
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a
+left join b on a.b_id = b.id
+left join (values(1),(1)) d(a)
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+                QUERY PLAN                
+------------------------------------------
+ Unique
+   ->  Sort
+         Sort Key: a.id, b.id, a.b_id
+         ->  Hash Left Join
+               Hash Cond: (a.b_id = b.id)
+               ->  Seq Scan on a
+               ->  Hash
+                     ->  Seq Scan on b
+(8 rows)
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+                   QUERY PLAN                   
+------------------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Merge Left Join
+         Merge Cond: (a.b_id = d.a)
+         ->  Sort
+               Sort Key: a.b_id
+               ->  Hash Left Join
+                     Hash Cond: (a.b_id = b.id)
+                     ->  Seq Scan on a
+                     ->  Hash
+                           ->  Seq Scan on b
+         ->  Sort
+               Sort Key: d.a
+               ->  Seq Scan on d
+(14 rows)
+
 -- check join removal with a cross-type comparison operator
 explain (costs off)
 select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index cbb71c5..31d6ff1 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1331,6 +1331,34 @@ explain (costs off)
 select d.* from d left join (select id from a union select id from b) s
   on d.a = s.id;
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+
+-- as above but with DISTINCT ON
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
+-- also ensure the removal works with other relation types.
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a
+left join b on a.b_id = b.id
+left join (values(1),(1)) d(a)
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
 -- check join removal with a cross-type comparison operator
 explain (costs off)
 select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 8 years ago

In reply to: David Rowley (#4)

Re: [HACKERS] Removing LEFT JOINs in more cases

On Thu, Nov 30, 2017 at 12:20 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

Thanks for looking over this and my apologies for the delay in my
response. I've been on leave and out of range of the internet for most
of that time.

On 23 November 2017 at 02:30, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
@@ -597,15 +615,25 @@ rel_supports_distinctness(PlannerInfo *root,
RelOptInfo *rel)
+        if (root->parse->distinctClause != NIL)
+            return true;
+
+        if (root->parse->groupClause != NIL && !root->parse->hasAggs)
+            return true;
+
The other callers of rel_supports_distinctness() are looking for distinctness
of the given relation, whereas the code change here are applicable to any
relation, not just the given relation. I find that confusing. Looking at the
way we are calling rel_supports_distinctness() in join_is_removable() this
change looks inevitable, but still if we could reduce the confusion, that will
be good. Also if we could avoid duplication of comment about unique index, that
will be good.
When I put this together I'd considered leaving
rel_supports_distinctness() alone since the other uses of the function
can't make use of the new checks. Since you mention this, then I think
it's likely better just to go with that and just special case the code
in join_is_removable() to check for rel_supports_distinctness() and
the DISTINCT/GROUP BY clauses too. This will mean less false positives
for usages such as in innerrel_is_unique() which would end up causing
a bit of extra work before possibly failing a bit later on. I've made
changes in the attached to do things this way.

May be you want to add a testcase with DISTINCT ON.

Good idea. I've also added a test case for this, and also one to
ensure non-RTE_RELATION relations can be removed too.

Updated patch attached.

Thanks again for the review.

Thanks for considering those suggestions. The patch looks good to me.
It applies cleanly, compiles on my laptop without warnings or errors.
make check does not show any failures. Marking it as ready for
committer.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Ashutosh Bapat (#5)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 1 December 2017 at 01:40, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The patch looks good to me.
It applies cleanly, compiles on my laptop without warnings or errors.
make check does not show any failures. Marking it as ready for
committer.

Great. Many thanks for the review!

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: David Rowley (#4)

Re: [HACKERS] Removing LEFT JOINs in more cases

David Rowley <david.rowley@2ndquadrant.com> writes:

[ remove_left_join_distinct_2017-11-30.patch ]

I looked at this, and it seems like an interesting idea, but I'm
unconvinced that the specific conditions here are OK.

One clear problem is that you consider a DISTINCT to be an automatic
get-out-of-jail-free card, but that doesn't work in the presence of
aggregates:

select distinct count(*) from a left join b on ...

So at the very least it needs to be more like

if ((root->parse->distinctClause != NIL ||
root->parse->groupClause != NIL) && !root->parse->hasAggs)

I believe it is probably necessary to reject this optimization if window
functions are present, as well, because DISTINCT would not happen till
after the wfunc calculation and wfuncs are definitely sensitive to the
number of rows they see. (But maybe wfuncs + GROUP BY are OK.)

Another case that seems less than safe is if the proposed-to-be-dropped
left join is within the LHS of any LATERAL join. For instance,

select distinct ... from (a left join b ...), lateral nextval('foo');

The row duplication from b would affect how often nextval gets called,
and the presence of the DISTINCT later on doesn't entitle us to change
that. (Maybe it'd be all right to perform the optimization if the
lateral function is known stable/immutable, but I think that's getting
too far afield; I'd be inclined to just drop the idea as soon as we
see a lateral dependency.)

Actually, that same case occurs for volatile functions in the tlist,
ie

select distinct nextval('foo') from a left join b ...

The presence of the DISTINCT again doesn't excuse changing how often
nextval() gets called.

I kinda doubt this list of counterexamples is exhaustive, either;
it's just what occurred to me in five or ten minutes thought.
So maybe you can make this idea work, but you need to think much
harder about what the counterexamples are.

As far as code structure goes, it's surely not going to be sane to
have two copies of the checking logic, much less try to cram one
of them into a single if. I'd suggest factoring it out into a
separate function that can be called from both places.

I'll set the patch back to Waiting on Author.

regards, tom lane

David Rowley

david.rowley@2ndquadrant.com

about 8 years ago

In reply to: Tom Lane (#7)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'll set the patch back to Waiting on Author.

Many thanks for looking at this. I'll try to resolve the things you've
mentioned this coming weekend.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: David Rowley (#8)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2018-01-10 11:14:27 +1300, David Rowley wrote:

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'll set the patch back to Waiting on Author.

Many thanks for looking at this. I'll try to resolve the things you've
mentioned this coming weekend.

This hasn't happened yet. As the last CF is already in progress I'm
unfortunately inclined to mark this returned with feedback?

Andres Freund

#10

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Andres Freund (#9)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2 March 2018 at 09:49, Andres Freund <andres@anarazel.de> wrote:

On 2018-01-10 11:14:27 +1300, David Rowley wrote:

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'll set the patch back to Waiting on Author.

Many thanks for looking at this. I'll try to resolve the things you've
mentioned this coming weekend.

This hasn't happened yet. As the last CF is already in progress I'm
unfortunately inclined to mark this returned with feedback?

I'm planning on making the required changes at the weekend.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: David Rowley (#10)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2018-03-02 09:53:38 +1300, David Rowley wrote:

On 2 March 2018 at 09:49, Andres Freund <andres@anarazel.de> wrote:

On 2018-01-10 11:14:27 +1300, David Rowley wrote:

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'll set the patch back to Waiting on Author.

Many thanks for looking at this. I'll try to resolve the things you've
mentioned this coming weekend.

This hasn't happened yet. As the last CF is already in progress I'm
unfortunately inclined to mark this returned with feedback?

I'm planning on making the required changes at the weekend.

Sorry if I'm grumpy, but why shouldn't this just mean it's slipped to
the next fest?

- Andres

#12

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Andres Freund (#11)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2 March 2018 at 09:56, Andres Freund <andres@anarazel.de> wrote:

On 2018-03-02 09:53:38 +1300, David Rowley wrote:

On 2 March 2018 at 09:49, Andres Freund <andres@anarazel.de> wrote:

On 2018-01-10 11:14:27 +1300, David Rowley wrote:

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'll set the patch back to Waiting on Author.

Many thanks for looking at this. I'll try to resolve the things you've
mentioned this coming weekend.

This hasn't happened yet. As the last CF is already in progress I'm
unfortunately inclined to mark this returned with feedback?

I'm planning on making the required changes at the weekend.

Sorry if I'm grumpy, but why shouldn't this just mean it's slipped to
the next fest?

Because in most of the world it's day 1 of the commitfest.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: David Rowley (#12)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2018-03-02 09:57:35 +1300, David Rowley wrote:

On 2 March 2018 at 09:56, Andres Freund <andres@anarazel.de> wrote:

On 2018-03-02 09:53:38 +1300, David Rowley wrote:

I'm planning on making the required changes at the weekend.

Sorry if I'm grumpy, but why shouldn't this just mean it's slipped to
the next fest?

Because in most of the world it's day 1 of the commitfest.

If a patch hasn't been updated after moved waiting-on-author from the
last CF, and the next CF started, that seems too late. Particularly in
the last CF.

Greetings,

Andres Freund

#14

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Andres Freund (#13)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2 March 2018 at 09:59, Andres Freund <andres@anarazel.de> wrote:

If a patch hasn't been updated after moved waiting-on-author from the
last CF, and the next CF started, that seems too late. Particularly in
the last CF.

I understand that I should have resolved these issues before the
commitfest, so please do what you see fit, but...

One observation is that it appears you're already running damage
control to triage the patches that will certainly not make it. I don't
think this particular patch is overly complex, so is that really a
good thing to do on the first of the month? It's certainly not that
encouraging for patch authors... I'd have imagined encouragement to be
better tactic at this stage, but will defer to your better judgment.

For the patch, it's a weekender patch for me, not related to 2q work.
Time is just harder to find, but I happen to have some this weekend
which I'd really like to dedicate to the patch. If you're willing to
hold fire until Sunday, if I've not done the work before then I'll
personally return it with feedback.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

Andres Freund

andres@anarazel.de

almost 8 years ago

In reply to: David Rowley (#14)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 2018-03-02 10:14:52 +1300, David Rowley wrote:

One observation is that it appears you're already running damage
control to triage the patches that will certainly not make it. I don't
think this particular patch is overly complex, so is that really a
good thing to do on the first of the month? It's certainly not that
encouraging for patch authors... I'd have imagined encouragement to be
better tactic at this stage, but will defer to your better judgment.

Well, we have a 250 patch fest, with relatively few people doing active
review and commit work. We're going to make unpleasant choice :(

I definitely agree it'd be a lot nicer if weren't as pressed for serious
review and committer cycles :(

For the patch, it's a weekender patch for me, not related to 2q work.
Time is just harder to find, but I happen to have some this weekend
which I'd really like to dedicate to the patch. If you're willing to
hold fire until Sunday, if I've not done the work before then I'll
personally return it with feedback.

Ok.

Greetings,

Andres Freund

#16

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#7)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

select distinct nextval('foo') from a left join b ...

The presence of the DISTINCT again doesn't excuse changing how often
nextval() gets called.

I kinda doubt this list of counterexamples is exhaustive, either;
it's just what occurred to me in five or ten minutes thought.
So maybe you can make this idea work, but you need to think much
harder about what the counterexamples are.

While working on the cases where the join removal should be disallowed
I discovered that the existing code is not too careful about this
either:

drop table if exists t1;

create table t1 (a int);
insert into t1 values(1);

create or replace function notice(pn int) returns int as $$
begin
raise notice '%', pn;
return pn;
end;
$$ volatile language plpgsql;

create unique index t1_a_uidx on t1(a);

explain (costs off, analyze, timing off, summary off)
select t1.a from t1 left join t1 t2 on t1.a = t2.a and notice(t2.a) = t1.a;
QUERY PLAN
----------------------------------------
Seq Scan on t1 (actual rows=1 loops=1)
(1 row)

drop index t1_a_uidx; -- drop the index to disallow left join removal.

explain (costs off, analyze, timing off, summary off)
select t1.a from t1 left join t1 t2 on t1.a = t2.a and notice(t2.a) = t1.a;
NOTICE: 1
QUERY PLAN
----------------------------------------------------------
Nested Loop Left Join (actual rows=1 loops=1)
Join Filter: ((t1.a = t2.a) AND (notice(t2.a) = t1.a))
-> Seq Scan on t1 (actual rows=1 loops=1)
-> Seq Scan on t1 t2 (actual rows=1 loops=1)
(4 rows)

Should this be fixed? or is this case somehow not worth worrying about?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: David Rowley (#16)

2 attachment(s)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 4 March 2018 at 18:35, David Rowley <david.rowley@2ndquadrant.com> wrote:

drop table if exists t1;

create table t1 (a int);
insert into t1 values(1);

create or replace function notice(pn int) returns int as $$
begin
raise notice '%', pn;
return pn;
end;
$$ volatile language plpgsql;

create unique index t1_a_uidx on t1(a);

explain (costs off, analyze, timing off, summary off)
select t1.a from t1 left join t1 t2 on t1.a = t2.a and notice(t2.a) = t1.a;
QUERY PLAN
----------------------------------------
Seq Scan on t1 (actual rows=1 loops=1)
(1 row)

drop index t1_a_uidx; -- drop the index to disallow left join removal.

explain (costs off, analyze, timing off, summary off)
select t1.a from t1 left join t1 t2 on t1.a = t2.a and notice(t2.a) = t1.a;
NOTICE: 1
QUERY PLAN
----------------------------------------------------------
Nested Loop Left Join (actual rows=1 loops=1)
Join Filter: ((t1.a = t2.a) AND (notice(t2.a) = t1.a))
-> Seq Scan on t1 (actual rows=1 loops=1)
-> Seq Scan on t1 t2 (actual rows=1 loops=1)
(4 rows)

Should this be fixed? or is this case somehow not worth worrying about?

Please find attached two patches. The first of which is intended to
resolve the issue mentioned above with consideration that it may need
to be back-patched to where LEFT JOIN removals where introduced.

Patch two is intended to implement LEFT JOIN removal for cases that
any duplicates rows that the join causes would be subsequently removed
again via a GROUP BY or DISTINCT clause.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Disallow-LEFT-JOIN-removal-when-join-or-base-qual.patchapplication/octet-stream; name=v2-0001-Disallow-LEFT-JOIN-removal-when-join-or-base-qual.patchDownload

From afaa0ccea09fddf7bd1114d0aaba9febaff3b245 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Sun, 4 Mar 2018 21:15:57 +1300
Subject: [PATCH v2 1/2] Disallow LEFT JOIN removal when join or base quals
 have volatile functions

Removing joins in such cases can cause volatile functions not to be executed
at all.
---
 src/backend/optimizer/plan/analyzejoins.c | 18 ++++++++++++++++
 src/test/regress/expected/join.out        | 34 +++++++++++++++++++++++++++++++
 src/test/regress/sql/join.sql             | 17 ++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index ef25fefa455..40f284b29c9 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -28,6 +28,7 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
+#include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "optimizer/var.h"
 #include "utils/lsyscache.h"
@@ -236,6 +237,16 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 			return false;		/* it does reference innerrel */
 	}
 
+	/*
+	 * Join removal must be disallowed if the baserestrictinfos contain any
+	 * volatile functions.  If we removed this join then there's no chance of
+	 * these functions being executed and some side affects of these may be
+	 * required.
+	 */
+	if (contain_volatile_functions((Node *)
+				extract_actual_clauses(innerrel->baserestrictinfo, false)))
+		return false;
+
 	/*
 	 * Search for mergejoinable clauses that constrain the inner rel against
 	 * either the outer rel or a pseudoconstant.  If an operator is
@@ -267,6 +278,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 			continue;			/* else, ignore; not useful here */
 		}
 
+		/*
+		 * Disallow the join removal completely if the join clause contains
+		 * any volatile functions.
+		 */
+		if (contain_volatile_functions((Node *) restrictinfo->clause))
+			return false;
+
 		/* Ignore if it's not a mergejoinable clause */
 		if (!restrictinfo->can_join ||
 			restrictinfo->mergeopfamilies == NIL)
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4d5931d67ef..76f31fad103 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4053,6 +4053,12 @@ INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
 INSERT INTO d VALUES (1,3), (2,2), (3,1);
+CREATE OR REPLACE FUNCTION notice(pn INT) RETURNS INT AS $$
+BEGIN
+	RAISE NOTICE '%', pn;
+	RETURN pn;
+END;
+$$ VOLATILE LANGUAGE plpgsql;
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -4162,6 +4168,34 @@ select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
  Seq Scan on int8_tbl i8
 (1 row)
 
+-- ensure the join removal is disallowed when there are any volatile functions
+-- in the base clauses to the removal rel.
+explain (costs off)
+select a.* from a left join b on a.b_id = b.id and notice(b.c_id) = 1;
+             QUERY PLAN             
+------------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.b_id)
+   ->  Seq Scan on b
+         Filter: (notice(c_id) = 1)
+   ->  Hash
+         ->  Seq Scan on a
+(6 rows)
+
+-- ensure the join removal is disallowed when there are any volatile functions
+-- in the join clauses.
+explain (costs off)
+select a.* from a left join b on a.b_id = b.id and a.b_id = notice(b.id);
+               QUERY PLAN               
+----------------------------------------
+ Hash Left Join
+   Hash Cond: (a.b_id = b.id)
+   Join Filter: (a.b_id = notice(b.id))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(6 rows)
+
 -- check join removal with lateral references
 explain (costs off)
 select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 30dfde223ef..4be64eec293 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1328,6 +1328,13 @@ INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
 INSERT INTO d VALUES (1,3), (2,2), (3,1);
 
+CREATE OR REPLACE FUNCTION notice(pn INT) RETURNS INT AS $$
+BEGIN
+	RAISE NOTICE '%', pn;
+	RETURN pn;
+END;
+$$ VOLATILE LANGUAGE plpgsql;
+
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
 explain (costs off) SELECT b.* FROM b LEFT JOIN c ON b.c_id = c.id;
@@ -1376,6 +1383,16 @@ explain (costs off)
 select i8.* from int8_tbl i8 left join (select f1 from int4_tbl group by f1) i4
   on i8.q1 = i4.f1;
 
+-- ensure the join removal is disallowed when there are any volatile functions
+-- in the base clauses to the removal rel.
+explain (costs off)
+select a.* from a left join b on a.b_id = b.id and notice(b.c_id) = 1;
+
+-- ensure the join removal is disallowed when there are any volatile functions
+-- in the join clauses.
+explain (costs off)
+select a.* from a left join b on a.b_id = b.id and a.b_id = notice(b.id);
+
 -- check join removal with lateral references
 explain (costs off)
 select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
-- 
2.16.2.windows.1

v2-0002-Allow-LEFT-JOINs-to-be-removed-in-more-cases.patchapplication/octet-stream; name=v2-0002-Allow-LEFT-JOINs-to-be-removed-in-more-cases.patchDownload

From 31edfa296a81431e2c68401da633c4a3df4eff98 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Sun, 4 Mar 2018 21:32:57 +1300
Subject: [PATCH v2 2/2] Allow LEFT JOINs to be removed in more cases

Normally, LEFT JOINs to base rels can only be removed if none of the columns
from the joined relation are used in the query and if the relation has a
unique index which proves the join cannot duplicate any rows on the other side
of the join.

This commit relaxes this a little to no longer require the unique index to be
present.  Alternative proofs are used in the form of DISTINCT or GROUP BY
clauses.  The idea here being that it does not matter if the join causes row
duplicate if some DISTINCT/GROUP BY will subsequently remove those duplicate
rows again.  However, there are still cases that we need to beware of and not
allow this optimization to take place.  For example, if there are aggregate
functions in the query then these will be sensitive to these additional
duplicated rows, the case is the same for windowing functions.

On successful join removal of such a case the benefits are three-fold:

1. The join search becomes more simple.
2. No waste of effort joining the relation in the first place.
3. No waste of effort removing the additional duplicate rows caused by the
   join.
---
 src/backend/optimizer/plan/analyzejoins.c | 144 +++++++++++++++++++++-
 src/test/regress/expected/join.out        | 193 ++++++++++++++++++++++++++++++
 src/test/regress/sql/join.sql             |  71 +++++++++++
 3 files changed, 402 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 40f284b29c9..1ca6f5b125a 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -38,6 +38,8 @@ static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
+static bool query_distinctifies_rows(Query *query);
+static bool rel_distinctified_after_join(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
 					List *clause_list);
@@ -151,10 +153,12 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
  *	  it will just duplicate its left input.
  *
  * This is true for a left join for which the join condition cannot match
- * more than one inner-side row.  (There are other possibly interesting
- * cases, but we don't have the infrastructure to prove them.)  We also
- * have to check that the inner side doesn't generate any variables needed
- * above the join.
+ * more than one inner-side row.  We can also allow removal of joins to
+ * relations that may match more than one inner-side row if a DISTINCT or
+ * GROUP BY clause would subsequently remove any duplicates caused by the
+ * join. (There are other possibly interesting cases, but we don't have the
+ * infrastructure to prove them.)  We also have to check that the inner side
+ * doesn't generate any variables needed above the join.
  */
 static bool
 join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
@@ -182,9 +186,10 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	/*
 	 * Before we go to the effort of checking whether any innerrel variables
 	 * are needed above the join, make a quick check to eliminate cases in
-	 * which we will surely be unable to prove uniqueness of the innerrel.
+	 * which we will surely be unable to remove the join.
 	 */
-	if (!rel_supports_distinctness(root, innerrel))
+	if (!rel_supports_distinctness(root, innerrel) &&
+		!query_distinctifies_rows(root->parse))
 		return false;
 
 	/* Compute the relid set for the join we are considering */
@@ -302,6 +307,14 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		clause_list = lappend(clause_list, restrictinfo);
 	}
 
+	/*
+	 * Check to see if the relation cannot possibly cause any duplication in
+	 * the result set due to the query having a DISTINCT or GROUP BY clause
+	 * that would eliminate any duplicate rows produced by the join anyway.
+	 */
+	if (rel_distinctified_after_join(root, innerrel))
+		return true;
+
 	/*
 	 * Now that we have the relevant equality join clauses, try to prove the
 	 * innerrel distinct.
@@ -594,6 +607,125 @@ reduce_unique_semijoins(PlannerInfo *root)
 	}
 }
 
+/*
+ * query_distinctifies_rows
+ *		Returns true if the query has any properties that allow duplicate
+ *		results to be removed.
+ *
+ * This function is just a pre-check function to check if the query possibly
+ * has some properties that disallow duplicate rows from appearing in the
+ * result set.  More specifically we're interested in knowing if we can
+ * possibly exercise any means to have fewer duplicate rows in the result set
+ * prior to the de-duplication process.
+ *
+ * It's not critical that we always return false when this is not possible, as
+ * this function just serves as a pre-check to prevent further checks going
+ * ahead with expensive processing if we have no chance of success.
+ *
+ * If this function returns false then rel_tuples_needed must also return
+ * false.
+ */
+static bool
+query_distinctifies_rows(Query *query)
+{
+	/* a query with aggs or wfuncs needs all input rows from the join */
+	if (query->hasAggs || query->hasWindowFuncs)
+		return false;
+
+	/* no distinctification possible if there's no DISTINCT or GROUP BY */
+	if (query->distinctClause == NIL && query->groupClause == NIL)
+		return false;
+
+	/*
+	 * We shouldn't waste too much effort here as rel_distinctified_after_join
+	 * serves as the final check, but perhaps there's some other things common
+	 * enough and cheap enough to check here?
+	 */
+	return true;
+}
+
+/*
+ * rel_distinctified_after_join
+ *		Returns false if tuples from 'rel' are required as input for any part
+ *		of the plan.
+ *
+ * Our aim here is to determine if we can remove the join to 'rel' without
+ * having any affect on the final results.  If 'rel's join partner can match
+ * any single row to more than one of rel's output rows, then the rel would
+ * normally be required to maintain the correct result set.  However, if the
+ * query has a GROUP BY or DISTINCT clause then any row duplication caused
+ * will be removed again.  To complicate the matter we also need to ensure
+ * that the additional rows are not required for anything before the final
+ * resultset is returned.  For example, aggregate functions will see any
+ * duplicate rows, as will windowing functions.  We must also be careful not
+ * to change the number of times a volatile function is evaulated.
+ *
+ * We must only return true if we're completely certain that the tuples are
+ * not needed. If uncertain we'll play it safe and return false.
+ *
+ * Note: If making changes here then query_distinctifies_rows may need updated
+ */
+static bool
+rel_distinctified_after_join(PlannerInfo *root, RelOptInfo *rel)
+{
+	Query	   *query = root->parse;
+	Index		rti;
+
+	/* Aggs and window funcs are sensitive to row duplication from joins */
+	if (query->hasAggs || query->hasWindowFuncs)
+		return false;
+
+	/*
+	 * Ensure we have a DISTINCT or GROUP BY clause to remove duplicates. If
+	 * we have a GROUP BY clause then we'd better make sure there are not any
+	 * volatile functions in it.  If we have both a GROUP BY and a DISTINCT
+	 * clause and we have no aggregates then we shouldn't care too much about
+	 * volatile functions in the distinctClauses as GROUP BY will have done
+	 * all the distinctification work and all the columns in the targetlist
+	 * must be functionally dependant on the GROUP BY clauses anyway.  In the
+	 * case for DISTINCT ON the remaining targetlist items will be evaulated
+	 * after the results have been made distinct, so are only going to be
+	 * evaulated once whether we remove the join or not.
+	 */
+	if (query->groupClause == NIL)
+	{
+		if (query->distinctClause == NIL)
+			return false;
+		else if (contain_volatile_functions((Node *) query->distinctClause))
+			return false;
+	}
+	else if (contain_volatile_functions((Node *) query->groupClause))
+		return false;
+
+	/*
+	 * We must disallow the removal of the join if there are any LATERAL joins
+	 * to a volatile function using Vars from any relation at this level.  If
+	 * we removed these then it could reduce the number of times that the
+	 * volatile function is evaluated.
+	 */
+	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	{
+		RelOptInfo *brel = root->simple_rel_array[rti];
+		RangeTblEntry *rte;
+
+		/* there may be empty slots corresponding to non-baserel RTEs */
+		if (brel == NULL)
+			continue;
+
+		Assert(brel->relid == rti); /* sanity check on array */
+
+		/* ignore RTEs that are "other rels" */
+		if (brel->reloptkind != RELOPT_BASEREL)
+			continue;
+
+		rte = root->simple_rte_array[rti];
+		if (rte->rtekind == RTE_FUNCTION && brel->lateral_vars != NIL &&
+			contain_volatile_functions((Node *) rte->functions))
+			return false;
+	}
+
+	return true;		/* relation not needed */
+}
 
 /*
  * rel_supports_distinctness
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 76f31fad103..f921a1f0d14 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4208,6 +4208,199 @@ select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
          Filter: (a.id = i)
 (4 rows)
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+               QUERY PLAN                
+-----------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id, b.id, b.c_id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- as above but with DISTINCT ON
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+                QUERY PLAN                
+------------------------------------------
+ Unique
+   ->  Sort
+         Sort Key: a.id, b.id, a.b_id
+         ->  Hash Left Join
+               Hash Cond: (a.b_id = b.id)
+               ->  Seq Scan on a
+               ->  Hash
+                     ->  Seq Scan on b
+(8 rows)
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+             QUERY PLAN             
+------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- also ensure the removal works with other relation types.
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a
+left join b on a.b_id = b.id
+left join (values(1),(1)) d(a)
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+                QUERY PLAN                
+------------------------------------------
+ Unique
+   ->  Sort
+         Sort Key: a.id, b.id, a.b_id
+         ->  Hash Left Join
+               Hash Cond: (a.b_id = b.id)
+               ->  Seq Scan on a
+               ->  Hash
+                     ->  Seq Scan on b
+(8 rows)
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+                   QUERY PLAN                   
+------------------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Merge Left Join
+         Merge Cond: (a.b_id = d.a)
+         ->  Sort
+               Sort Key: a.b_id
+               ->  Hash Left Join
+                     Hash Cond: (a.b_id = b.id)
+                     ->  Seq Scan on a
+                     ->  Hash
+                           ->  Seq Scan on b
+         ->  Sort
+               Sort Key: d.a
+               ->  Seq Scan on d
+(14 rows)
+
+-- removal of left joins using distinct or group by instead of a unique index
+-- matching the join condition of the left joined table must be disallowed if
+-- the query contains a lateral joined volatile function.  A reduction in the
+-- number of rows produced by the join could affect the number of evaluations
+-- of the volatile function.
+create or replace function vseries(pstart int, pend int) returns setof int as $$
+begin
+	while pstart <= pend
+	loop
+		raise notice '%', pstart;
+		return next pstart;
+		pstart := pstart + 1;
+	end loop;
+end;
+$$ volatile language plpgsql;
+-- ensure join to "d" is not removed (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral vseries(a.id, a.id);
+              QUERY PLAN               
+---------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id
+   ->  Nested Loop
+         ->  Hash Right Join
+               Hash Cond: (d.a = a.id)
+               ->  Seq Scan on d
+               ->  Hash
+                     ->  Seq Scan on a
+         ->  Function Scan on vseries
+(9 rows)
+
+-- ensure join to "d" is not removed (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral vseries(a.id, a.id) group by a.id;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Group
+   Group Key: a.id
+   ->  Nested Loop
+         ->  Merge Left Join
+               Merge Cond: (a.id = d.a)
+               ->  Index Only Scan using a_pkey on a
+               ->  Sort
+                     Sort Key: d.a
+                     ->  Seq Scan on d
+         ->  Function Scan on vseries
+(10 rows)
+
+-- ensure that the join to "d" is removed when the function is not volatile (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral generate_series(a.id, a.id);
+                  QUERY PLAN                  
+----------------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id
+   ->  Nested Loop
+         ->  Seq Scan on a
+         ->  Function Scan on generate_series
+(5 rows)
+
+-- ensure that the join to "d" is removed when the function is not volatile (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral generate_series(a.id, a.id) group by a.id;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Group
+   Group Key: a.id
+   ->  Nested Loop
+         ->  Index Only Scan using a_pkey on a
+         ->  Function Scan on generate_series
+(5 rows)
+
+-- ensure join to "d" can be removed despite the volatile function in the target list.
+explain (costs off)
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Result
+   ->  Unique
+         ->  Index Only Scan using a_pkey on a
+(3 rows)
+
+-- ensure the volatile function is just executed once per a.id
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+NOTICE:  0
+NOTICE:  1
+ notice 
+--------
+      0
+      1
+(2 rows)
+
+-- and it should be evaulated multiple times when there's no distinct on.
+select notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+NOTICE:  0
+NOTICE:  0
+NOTICE:  0
+NOTICE:  1
+ notice 
+--------
+      0
+      0
+      0
+      1
+(4 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 4be64eec293..4a53f9112de 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1398,6 +1398,77 @@ explain (costs off)
 select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
 			  lateral generate_series(1, q.id) gs(i) where q.id = gs.i;
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+
+-- as above but with DISTINCT ON
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
+-- also ensure the removal works with other relation types.
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a
+left join b on a.b_id = b.id
+left join (values(1),(1)) d(a)
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
+-- removal of left joins using distinct or group by instead of a unique index
+-- matching the join condition of the left joined table must be disallowed if
+-- the query contains a lateral joined volatile function.  A reduction in the
+-- number of rows produced by the join could affect the number of evaluations
+-- of the volatile function.
+
+create or replace function vseries(pstart int, pend int) returns setof int as $$
+begin
+	while pstart <= pend
+	loop
+		raise notice '%', pstart;
+		return next pstart;
+		pstart := pstart + 1;
+	end loop;
+end;
+$$ volatile language plpgsql;
+
+-- ensure join to "d" is not removed (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral vseries(a.id, a.id);
+
+-- ensure join to "d" is not removed (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral vseries(a.id, a.id) group by a.id;
+
+-- ensure that the join to "d" is removed when the function is not volatile (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral generate_series(a.id, a.id);
+
+-- ensure that the join to "d" is removed when the function is not volatile (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral generate_series(a.id, a.id) group by a.id;
+
+-- ensure join to "d" can be removed despite the volatile function in the target list.
+explain (costs off)
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+
+-- ensure the volatile function is just executed once per a.id
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+
+-- and it should be evaulated multiple times when there's no distinct on.
+select notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);
-- 
2.16.2.windows.1

#18

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Rowley (#16)

Re: [HACKERS] Removing LEFT JOINs in more cases

David Rowley <david.rowley@2ndquadrant.com> writes:

On 10 January 2018 at 08:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

select distinct nextval('foo') from a left join b ...
The presence of the DISTINCT again doesn't excuse changing how often
nextval() gets called.

While working on the cases where the join removal should be disallowed
I discovered that the existing code is not too careful about this
either:
[ a volatile function can be removed in a case like this ]
select t1.a from t1 left join t1 t2 on t1.a = t2.a and notice(t2.a) = t1.a;
Should this be fixed? or is this case somehow not worth worrying about?

I don't find that example troubling. The execution of functions in WHERE
has never been particularly constrained. Would you insist, say, that
notice() be evaluated for every pair of rows in the cross product of
t1 and t2, even the ones that don't pass the t1.a = t2.a condition?
Or for a more interesting example, consider these two cases:

select * from t1, t2 where notice(t2.a) = t1.a;
select * from t1, t2 where notice(t2.a) = t2.b;

With our current implementation, the first will result in executing
notice() for every row pair in the cross product, while the second
will evaluate it only once per row of t2, because the condition is
pushed down to the scan level. Should we stop doing that?

In short, the number of executions of functions in WHERE or JOIN/ON
isn't at all implementation-independent. We document this in
https://www.postgresql.org/docs/devel/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL
where it says

It is particularly dangerous to rely on side effects or evaluation
order in WHERE and HAVING clauses, since those clauses are
extensively reprocessed as part of developing an execution
plan.

Maybe we ought to be more verbose there, but I don't care to abandon
the principle that we can reorder WHERE clauses, or skip the evaluation
of unnecessary clauses, as much as we want.

The case I was complaining about upthread involved volatiles in
the SELECT target list, which *does* have a well-defined number
of executions, ie once per row produced by the FROM/WHERE clause.

regards, tom lane

#19

David Rowley

david.rowley@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#18)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 5 March 2018 at 04:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

select * from t1, t2 where notice(t2.a) = t1.a;
select * from t1, t2 where notice(t2.a) = t2.b;

With our current implementation, the first will result in executing
notice() for every row pair in the cross product, while the second
will evaluate it only once per row of t2, because the condition is
pushed down to the scan level. Should we stop doing that?

I think we'd get complaints.

I admit to finding it difficult to know what the rules are here. For
example, qual_is_pushdown_safe() refuses to push volatile quals down
into subqueries. I'm a bit unsure why that's disallowed, but
reordering WHERE clause items containing volatile functions is
allowed.

Volatile functions in a subquery targetlist are not guaranteed to be
evaluated as we may push down a non-volatile qual into the subquery
causing fewer rows to be produced by the subquery resulting in fewer
evaluations of the targetlist expressions.

Here's an example of that happening:

EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM (SELECT id,notice(id) FROM t1 GROUP BY id) t1 WHERE t1.id = -10;
QUERY PLAN
----------------------------------------------
Group (actual rows=0 loops=1)
Group Key: t1.id
-> Seq Scan on t1 (actual rows=0 loops=1)
Filter: (id = '-10'::integer)
Rows Removed by Filter: 1
(5 rows)

-- use OFFSET 0 to block the pushdown of the non-volatile qual.
EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM (SELECT id,notice(id) FROM t1 GROUP BY id OFFSET 0) t1
WHERE t1.id = -10;
NOTICE: 1
QUERY PLAN
---------------------------------------------------------
Subquery Scan on t1 (actual rows=0 loops=1)
Filter: (t1.id = '-10'::integer)
Rows Removed by Filter: 1
-> HashAggregate (actual rows=1 loops=1)
Group Key: t1_1.id
-> Seq Scan on t1 t1_1 (actual rows=1 loops=1)
(6 rows)

In short, the number of executions of functions in WHERE or JOIN/ON
isn't at all implementation-independent. We document this in
https://www.postgresql.org/docs/devel/static/sql-expressions.html#SYNTAX-EXPRESS-EVAL
where it says

It is particularly dangerous to rely on side effects or evaluation
order in WHERE and HAVING clauses, since those clauses are
extensively reprocessed as part of developing an execution
plan.

Maybe we ought to be more verbose there, but I don't care to abandon
the principle that we can reorder WHERE clauses, or skip the evaluation
of unnecessary clauses, as much as we want.

The case I was complaining about upthread involved volatiles in
the SELECT target list, which *does* have a well-defined number
of executions, ie once per row produced by the FROM/WHERE clause.

I'm not so sure that's completely true. If you add an SRF then we
seem to get an extra evaluation per row.

# select generate_Series(1,2),notice(id) from t1;
NOTICE: 1
NOTICE: 1
NOTICE: 1
generate_series | notice
-----------------+--------
1 | 1
2 | 1
(2 rows)

You also complained about LATERAL joins to volatile functions, but
that too seems not guaranteed and seems to depend on the query plan
chosen, as highlighted in:

CREATE OR REPLACE FUNCTION notice(pn INT) RETURNS INT AS $$
BEGIN
RAISE NOTICE '%', pn;
RETURN pn;
END;
$$ VOLATILE LANGUAGE plpgsql;

CREATE TABLE t1 (id int);
CREATE TABLE t2 (a int);

INSERT INTO t1 values(1);
INSERT INTO t2 values(1),(1);

EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.a, LATERAL notice(t1.id);
NOTICE: 1
QUERY PLAN
-------------------------------------------------------------------
Merge Left Join (actual rows=2 loops=1)
Merge Cond: (t1.id = t2.a)
-> Sort (actual rows=1 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (actual rows=1 loops=1)
-> Seq Scan on t1 (actual rows=1 loops=1)
-> Function Scan on notice (actual rows=1 loops=1)
-> Sort (actual rows=2 loops=1)
Sort Key: t2.a
Sort Method: quicksort Memory: 25kB
-> Seq Scan on t2 (actual rows=2 loops=1)
(12 rows)

SET from_collapse_limit = 1;

EXPLAIN (COSTS OFF, ANALYZE, TIMING OFF, SUMMARY OFF)
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.a, LATERAL notice(t1.id);
NOTICE: 1
NOTICE: 1
QUERY PLAN
----------------------------------------------------------
Nested Loop (actual rows=2 loops=1)
-> Merge Left Join (actual rows=2 loops=1)
Merge Cond: (t1.id = t2.a)
-> Sort (actual rows=1 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on t1 (actual rows=1 loops=1)
-> Sort (actual rows=2 loops=1)
Sort Key: t2.a
Sort Method: quicksort Memory: 25kB
-> Seq Scan on t2 (actual rows=2 loops=1)
-> Function Scan on notice (actual rows=1 loops=2)
(12 rows)

RESET from_collapse_limit;

In short, I'm just a little confused at our guarantees. The best I
can see is that we guarantee a volatile function will be evaluated if
it's in the outer query's target list, providing the targetlist does
not have any set-returning functions present, which seems a long way
short of what the documents warn users about.

For now, if we ignore patch 0001, as neither of us is particularly
concerned about that. Patch 0002 might need to have the volatility
checks removed from the GROUP BY clause, but I'm a bit uncertain about
that given the examples I showed above.

Thanks for your feedback so far.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#17)

1 attachment(s)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 4 March 2018 at 21:43, David Rowley <david.rowley@2ndquadrant.com> wrote:

Please find attached two patches. The first of which is intended to
resolve the issue mentioned above with consideration that it may need
to be back-patched to where LEFT JOIN removals where introduced.

Patch two is intended to implement LEFT JOIN removal for cases that
any duplicates rows that the join causes would be subsequently removed
again via a GROUP BY or DISTINCT clause.

I've attached an updated version of this patch. This time there's only
1 patch. Per the discussion above about volatile functions, there's no
need to disallow the join removal when there are volatile functions in
the join clauses.

The only changes to the 0002 patch are now that it's named 0001 and
the tests were changed slightly as a plpgsql function was defined in
the old 0001 which we need for the new 0001.

http://commitfest.cputube.org was also complaining about the tests
failing. This was due to a plan changed caused by 1007b0a1. The plan
change was in the old 0001 patch, which I'm now dropping, so
hopefully, a new email with just the 1 patch will let cputube know
about me dropping the other one.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v3-0001-Allow-LEFT-JOINs-to-be-removed-in-more-cases.patchapplication/octet-stream; name=v3-0001-Allow-LEFT-JOINs-to-be-removed-in-more-cases.patchDownload

From a4df75a65fc51dd4b1aa31149ae68540de91f847 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Mon, 16 Jul 2018 01:30:56 +1200
Subject: [PATCH v3] Allow LEFT JOINs to be removed in more cases

Normally, LEFT JOINs to base rels can only be removed if none of the
columns from the joined relation are used in the query and if the relation
has a unique index which proves the join cannot duplicate any rows on the
other side of the join.

This commit relaxes this a little to no longer require the unique index to
be present.  Alternative proofs are used in the form of DISTINCT or GROUP
BY clauses.  The idea here being that it does not matter if the join causes
row duplicate if some DISTINCT/GROUP BY will subsequently remove those
duplicate rows again.  However, there are still cases that we need to
beware of and not allow this optimization to take place.  For example, if
there are aggregate functions in the query then these will be sensitive to
these additional duplicated rows, the case is the same for windowing
functions.

On successful join removal of such a case the benefits are three-fold:

1. The join search becomes more simple.
2. No waste of effort joining the relation in the first place.
3. No waste of effort removing the additional duplicate rows caused by the
   join.
---
 src/backend/optimizer/plan/analyzejoins.c | 144 ++++++++++++++++++++-
 src/test/regress/expected/join.out        | 199 ++++++++++++++++++++++++++++++
 src/test/regress/sql/join.sql             |  78 ++++++++++++
 3 files changed, 415 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 0e73f9cf4c..ff807b6e77 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -37,6 +37,8 @@ static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
+static bool query_distinctifies_rows(Query *query);
+static bool rel_distinctified_after_join(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
 					List *clause_list);
@@ -151,10 +153,12 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
  *	  it will just duplicate its left input.
  *
  * This is true for a left join for which the join condition cannot match
- * more than one inner-side row.  (There are other possibly interesting
- * cases, but we don't have the infrastructure to prove them.)  We also
- * have to check that the inner side doesn't generate any variables needed
- * above the join.
+ * more than one inner-side row.  We can also allow removal of joins to
+ * relations that may match more than one inner-side row if a DISTINCT or
+ * GROUP BY clause would subsequently remove any duplicates caused by the
+ * join. (There are other possibly interesting cases, but we don't have the
+ * infrastructure to prove them.)  We also have to check that the inner side
+ * doesn't generate any variables needed above the join.
  */
 static bool
 join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
@@ -182,9 +186,10 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	/*
 	 * Before we go to the effort of checking whether any innerrel variables
 	 * are needed above the join, make a quick check to eliminate cases in
-	 * which we will surely be unable to prove uniqueness of the innerrel.
+	 * which we will surely be unable to remove the join.
 	 */
-	if (!rel_supports_distinctness(root, innerrel))
+	if (!rel_supports_distinctness(root, innerrel) &&
+		!query_distinctifies_rows(root->parse))
 		return false;
 
 	/* Compute the relid set for the join we are considering */
@@ -284,6 +289,14 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		clause_list = lappend(clause_list, restrictinfo);
 	}
 
+	/*
+	 * Check to see if the relation cannot possibly cause any duplication in
+	 * the result set due to the query having a DISTINCT or GROUP BY clause
+	 * that would eliminate any duplicate rows produced by the join anyway.
+	 */
+	if (rel_distinctified_after_join(root, innerrel))
+		return true;
+
 	/*
 	 * Now that we have the relevant equality join clauses, try to prove the
 	 * innerrel distinct.
@@ -576,6 +589,125 @@ reduce_unique_semijoins(PlannerInfo *root)
 	}
 }
 
+/*
+ * query_distinctifies_rows
+ *		Returns true if the query has any properties that allow duplicate
+ *		results to be removed.
+ *
+ * This function is just a pre-check function to check if the query possibly
+ * has some properties that disallow duplicate rows from appearing in the
+ * result set.  More specifically we're interested in knowing if we can
+ * possibly exercise any means to have fewer duplicate rows in the result set
+ * prior to the de-duplication process.
+ *
+ * It's not critical that we always return false when this is not possible, as
+ * this function just serves as a pre-check to prevent further checks going
+ * ahead with expensive processing if we have no chance of success.
+ *
+ * If this function returns false then rel_tuples_needed must also return
+ * false.
+ */
+static bool
+query_distinctifies_rows(Query *query)
+{
+	/* a query with aggs or wfuncs needs all input rows from the join */
+	if (query->hasAggs || query->hasWindowFuncs)
+		return false;
+
+	/* no distinctification possible if there's no DISTINCT or GROUP BY */
+	if (query->distinctClause == NIL && query->groupClause == NIL)
+		return false;
+
+	/*
+	 * We shouldn't waste too much effort here as rel_distinctified_after_join
+	 * serves as the final check, but perhaps there's some other things common
+	 * enough and cheap enough to check here?
+	 */
+	return true;
+}
+
+/*
+ * rel_distinctified_after_join
+ *		Returns false if tuples from 'rel' are required as input for any part
+ *		of the plan.
+ *
+ * Our aim here is to determine if we can remove the join to 'rel' without
+ * having any affect on the final results.  If 'rel's join partner can match
+ * any single row to more than one of rel's output rows, then the rel would
+ * normally be required to maintain the correct result set.  However, if the
+ * query has a GROUP BY or DISTINCT clause then any row duplication caused
+ * will be removed again.  To complicate the matter we also need to ensure
+ * that the additional rows are not required for anything before the final
+ * resultset is returned.  For example, aggregate functions will see any
+ * duplicate rows, as will windowing functions.  We must also be careful not
+ * to change the number of times a volatile function is evaulated.
+ *
+ * We must only return true if we're completely certain that the tuples are
+ * not needed. If uncertain we'll play it safe and return false.
+ *
+ * Note: If making changes here then query_distinctifies_rows may need updated
+ */
+static bool
+rel_distinctified_after_join(PlannerInfo *root, RelOptInfo *rel)
+{
+	Query	   *query = root->parse;
+	Index		rti;
+
+	/* Aggs and window funcs are sensitive to row duplication from joins */
+	if (query->hasAggs || query->hasWindowFuncs)
+		return false;
+
+	/*
+	 * Ensure we have a DISTINCT or GROUP BY clause to remove duplicates. If
+	 * we have a GROUP BY clause then we'd better make sure there are not any
+	 * volatile functions in it.  If we have both a GROUP BY and a DISTINCT
+	 * clause and we have no aggregates then we shouldn't care too much about
+	 * volatile functions in the distinctClauses as GROUP BY will have done
+	 * all the distinctification work and all the columns in the targetlist
+	 * must be functionally dependant on the GROUP BY clauses anyway.  In the
+	 * case for DISTINCT ON the remaining targetlist items will be evaulated
+	 * after the results have been made distinct, so are only going to be
+	 * evaulated once whether we remove the join or not.
+	 */
+	if (query->groupClause == NIL)
+	{
+		if (query->distinctClause == NIL)
+			return false;
+		else if (contain_volatile_functions((Node *) query->distinctClause))
+			return false;
+	}
+	else if (contain_volatile_functions((Node *) query->groupClause))
+		return false;
+
+	/*
+	 * We must disallow the removal of the join if there are any LATERAL joins
+	 * to a volatile function using Vars from any relation at this level.  If
+	 * we removed these then it could reduce the number of times that the
+	 * volatile function is evaluated.
+	 */
+	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	{
+		RelOptInfo *brel = root->simple_rel_array[rti];
+		RangeTblEntry *rte;
+
+		/* there may be empty slots corresponding to non-baserel RTEs */
+		if (brel == NULL)
+			continue;
+
+		Assert(brel->relid == rti); /* sanity check on array */
+
+		/* ignore RTEs that are "other rels" */
+		if (brel->reloptkind != RELOPT_BASEREL)
+			continue;
+
+		rte = root->simple_rte_array[rti];
+		if (rte->rtekind == RTE_FUNCTION && brel->lateral_vars != NIL &&
+			contain_volatile_functions((Node *) rte->functions))
+			return false;
+	}
+
+	return true;		/* relation not needed */
+}
 
 /*
  * rel_supports_distinctness
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index dc6262be43..b63b1166c2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4231,6 +4231,205 @@ select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
          Filter: (a.id = i)
 (4 rows)
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+               QUERY PLAN                
+-----------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id, b.id, b.c_id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- as above but with DISTINCT ON
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+                QUERY PLAN                
+------------------------------------------
+ Unique
+   ->  Sort
+         Sort Key: a.id, b.id, a.b_id
+         ->  Hash Left Join
+               Hash Cond: (a.b_id = b.id)
+               ->  Seq Scan on a
+               ->  Hash
+                     ->  Seq Scan on b
+(8 rows)
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+             QUERY PLAN             
+------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Hash Left Join
+         Hash Cond: (a.b_id = b.id)
+         ->  Seq Scan on a
+         ->  Hash
+               ->  Seq Scan on b
+(7 rows)
+
+-- also ensure the removal works with other relation types.
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a
+left join b on a.b_id = b.id
+left join (values(1),(1)) d(a)
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+                QUERY PLAN                
+------------------------------------------
+ Unique
+   ->  Sort
+         Sort Key: a.id, b.id, a.b_id
+         ->  Hash Left Join
+               Hash Cond: (a.b_id = b.id)
+               ->  Seq Scan on a
+               ->  Hash
+                     ->  Seq Scan on b
+(8 rows)
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+                   QUERY PLAN                   
+------------------------------------------------
+ HashAggregate
+   Group Key: a.id, b.id
+   ->  Merge Left Join
+         Merge Cond: (a.b_id = d.a)
+         ->  Sort
+               Sort Key: a.b_id
+               ->  Hash Left Join
+                     Hash Cond: (a.b_id = b.id)
+                     ->  Seq Scan on a
+                     ->  Hash
+                           ->  Seq Scan on b
+         ->  Sort
+               Sort Key: d.a
+               ->  Seq Scan on d
+(14 rows)
+
+-- removal of left joins using distinct or group by instead of a unique index
+-- matching the join condition of the left joined table must be disallowed if
+-- the query contains a lateral joined volatile function.  A reduction in the
+-- number of rows produced by the join could affect the number of evaluations
+-- of the volatile function.
+create or replace function vseries(pstart int, pend int) returns setof int as $$
+begin
+	while pstart <= pend
+	loop
+		raise notice '%', pstart;
+		return next pstart;
+		pstart := pstart + 1;
+	end loop;
+end;
+$$ volatile language plpgsql;
+-- ensure join to "d" is not removed (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral vseries(a.id, a.id);
+              QUERY PLAN               
+---------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id
+   ->  Nested Loop
+         ->  Hash Right Join
+               Hash Cond: (d.a = a.id)
+               ->  Seq Scan on d
+               ->  Hash
+                     ->  Seq Scan on a
+         ->  Function Scan on vseries
+(9 rows)
+
+-- ensure join to "d" is not removed (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral vseries(a.id, a.id) group by a.id;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Group
+   Group Key: a.id
+   ->  Nested Loop
+         ->  Merge Left Join
+               Merge Cond: (a.id = d.a)
+               ->  Index Only Scan using a_pkey on a
+               ->  Sort
+                     Sort Key: d.a
+                     ->  Seq Scan on d
+         ->  Function Scan on vseries
+(10 rows)
+
+-- ensure that the join to "d" is removed when the function is not volatile (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral generate_series(a.id, a.id);
+                  QUERY PLAN                  
+----------------------------------------------
+ HashAggregate
+   Group Key: a.id, a.b_id
+   ->  Nested Loop
+         ->  Seq Scan on a
+         ->  Function Scan on generate_series
+(5 rows)
+
+-- ensure that the join to "d" is removed when the function is not volatile (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral generate_series(a.id, a.id) group by a.id;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Group
+   Group Key: a.id
+   ->  Nested Loop
+         ->  Index Only Scan using a_pkey on a
+         ->  Function Scan on generate_series
+(5 rows)
+
+create or replace function notice(pn int) returns int as $$
+begin
+	raise notice '%', pn;
+	return pn;
+end;
+$$ volatile language plpgsql;
+-- ensure join to "d" can be removed despite the volatile function in the target list.
+explain (costs off)
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+                  QUERY PLAN                   
+-----------------------------------------------
+ Result
+   ->  Unique
+         ->  Index Only Scan using a_pkey on a
+(3 rows)
+
+-- ensure the volatile function is just executed once per a.id
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+NOTICE:  0
+NOTICE:  1
+ notice 
+--------
+      0
+      1
+(2 rows)
+
+-- and it should be evaulated multiple times when there's no distinct on.
+select notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+NOTICE:  0
+NOTICE:  0
+NOTICE:  0
+NOTICE:  1
+ notice 
+--------
+      0
+      0
+      0
+      1
+(4 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index d3ba2a1c33..b0a9b9b2d6 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1412,6 +1412,84 @@ explain (costs off)
 select 1 from (select a.id FROM a left join b on a.b_id = b.id) q,
 			  lateral generate_series(1, q.id) gs(i) where q.id = gs.i;
 
+-- check join removal works without a unique index on the left joined table
+-- when a DISTINCT or GROUP BY is present which would remove row duplication
+explain (costs off)
+select distinct a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a;
+
+-- as above but with DISTINCT ON
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+
+-- as above but with GROUP BY
+explain (costs off)
+select a.id,b.id from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
+-- also ensure the removal works with other relation types.
+explain (costs off)
+select distinct on (a.id,b.id) a.*,b.* from a
+left join b on a.b_id = b.id
+left join (values(1),(1)) d(a)
+	on a.b_id = d.a order by a.id,b.id,a.b_id;
+
+-- ensure no join removal when we must aggregate any duplicated rows
+explain (costs off)
+select a.id,b.id,count(*) from a left join b on a.b_id = b.id left join d
+	on a.b_id = d.a group by a.id,b.id;
+
+-- removal of left joins using distinct or group by instead of a unique index
+-- matching the join condition of the left joined table must be disallowed if
+-- the query contains a lateral joined volatile function.  A reduction in the
+-- number of rows produced by the join could affect the number of evaluations
+-- of the volatile function.
+
+create or replace function vseries(pstart int, pend int) returns setof int as $$
+begin
+	while pstart <= pend
+	loop
+		raise notice '%', pstart;
+		return next pstart;
+		pstart := pstart + 1;
+	end loop;
+end;
+$$ volatile language plpgsql;
+
+-- ensure join to "d" is not removed (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral vseries(a.id, a.id);
+
+-- ensure join to "d" is not removed (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral vseries(a.id, a.id) group by a.id;
+
+-- ensure that the join to "d" is removed when the function is not volatile (DISTINCT)
+explain (costs off)
+select distinct a.* from a left join d on a.id = d.a, lateral generate_series(a.id, a.id);
+
+-- ensure that the join to "d" is removed when the function is not volatile (GROUP BY)
+explain (costs off)
+select a.id from a left join d on a.id = d.a, lateral generate_series(a.id, a.id) group by a.id;
+
+create or replace function notice(pn int) returns int as $$
+begin
+	raise notice '%', pn;
+	return pn;
+end;
+$$ volatile language plpgsql;
+
+-- ensure join to "d" can be removed despite the volatile function in the target list.
+explain (costs off)
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+
+-- ensure the volatile function is just executed once per a.id
+select distinct on (a.id) notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+
+-- and it should be evaulated multiple times when there's no distinct on.
+select notice(a.id) from a left join d on a.id = d.a - d.a order by a.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);
-- 
2.16.2.windows.1

#21

Antonin Houska

ah@cybertec.at

over 7 years ago

In reply to: David Rowley (#20)

Re: [HACKERS] Removing LEFT JOINs in more cases

David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached an updated version of this patch.

Following are the findings from my review.

On the LATERAL references:

This query (proposed upthread by Tom and adjusted by me so it can be executed
on the your test tables)

select distinct t1.id from t1 left join t2 on t1.id=t2.b, lateral nextval('foo');

is subject to join removal because in rel_distinctified_after_join() you check
brel->lateral_vars, which is NIL in this case.

On the other hand I'm not sure that user should expect any number of
executions of the function in this particular case because the actual join
order can change: it can happen that the function scan is first joined to t1
and t2 is joined to the result. The case that requires more caution is that
the function references both t1 and t2, i.e. *all* tables of the left join
from which you want to remove the inner relation. Maybe that's the reason you
introduced the test for brel->lateral_vars, but you forgot to check the actual
variables in the list.

On the volatile function in the targetlist:

Volatile function in the DISTINCT ON expression isn't necessarily noticed by
your checks:

select distinct on (nextval('foo')) t1.id from t1 left join t2 on t1.id=t2.b;

Perhaps you should simply check the whole targetlist if there's no GROUP BY
clause.

On coding:

* join_is_removable(): Just like rel_distinctified_after_join() is called
before rel_is_distinct_for() - obviously because it's cheaper - I suggest
to call query_distinctifies_rows() before rel_supports_distinctness().

* header comment of query_distinctifies_rows() mentions rel_tuples_needed,
but I can't find such a function.

* rel_distinctified_after_join() does not really use the "rel"
argument. Besides removing it you probably should rename the function.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#22

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Antonin Houska (#21)

Re: [HACKERS] Removing LEFT JOINs in more cases

On 18 September 2018 at 20:02, Antonin Houska <ah@cybertec.at> wrote:

Following are the findings from my review.

Thanks for reviewing this. Time is short in this commitfest to make
any changes, so I'm going to return this with feedback with the
intention of addressing the items from your review for the next 'fest.

Cheers

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services