Allowing join removals for more join types

Started by David Rowleyover 11 years ago56 messages

dgrowleyml@gmail.com

over 11 years ago

I'm currently in the early stages of looking into expanding join removals.

Currently left outer joins can be removed if none of the columns of the
table are required for anything and the table being joined is a base table
that contains a unique index on all columns in the join clause.

The case I would like to work on is to allow sub queries where the query is
grouped by or distinct on all of the join columns.

Take the following as an example:

CREATE TABLE products (productid integer NOT NULL, code character
varying(32) NOT NULL);
CREATE TABLE sales (saleid integer NOT NULL, productid integer NOT NULL,
qty integer NOT NULL);

CREATE VIEW product_sales AS
SELECT p.productid,
p.code,
s.qty
FROM (products p
LEFT JOIN ( SELECT sales.productid,
sum(sales.qty) AS qty
FROM sales
GROUP BY sales.productid) s ON ((p.productid = s.productid)));

If a user does:
SELECT productid,code FROM product_sales;
Then, if I'm correct, the join on sales can be removed.

As I said above, I'm in the early stages of looking at this and I'm
currently a bit confused. Basically I've put a breakpoint at the top of
the join_is_removable function and I can see that innerrel->rtekind
is RTE_SUBQUERY for my query, so the function is going to return false. So
what I need to so is get access to innerrel->subroot->parse so that I can
look at groupClause and distinctClause. The thing is innerrel->subroot is
NULL in this case, but I see a comment for subroot saying "subroot -
PlannerInfo for subquery (NULL if it's not a subquery)" so I guess this
does not also mean "subroot - PlannerInfo for subquery (NOT NULL if it's a
subquery)"?

Has anyone got any pointers to where I might be able to get the Query
details for the subquery? These structures are quite new to me.

Regards

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: David Rowley (#1)

Re: Allowing join removals for more join types

On Sat, May 17, 2014 at 8:57 PM, David Rowley <dgrowleyml@gmail.com> wrote:

I'm currently in the early stages of looking into expanding join removals.

As I said above, I'm in the early stages of looking at this and I'm
currently a bit confused. Basically I've put a breakpoint at the top of
the join_is_removable function and I can see that innerrel->rtekind
is RTE_SUBQUERY for my query, so the function is going to return false. So
what I need to so is get access to innerrel->subroot->parse so that I can
look at groupClause and distinctClause. The thing is innerrel->subroot is
NULL in this case, but I see a comment for subroot saying "subroot -
PlannerInfo for subquery (NULL if it's not a subquery)" so I guess this
does not also mean "subroot - PlannerInfo for subquery (NOT NULL if it's a
subquery)"?

Has anyone got any pointers to where I might be able to get the Query
details for the subquery? These structures are quite new to me.

I think I've managed to answer my own question here. But please someone
correct me if this sounds wrong.
It looks like the existing join removals are done quite early in the
planning and redundant joins are removed before any subqueries from that
query are planned. So this innerrel->subroot->parse has not been done yet.
It seems to be done later in query_planner() when make_one_rel() is called.

The best I can come up with on how to implement this is to have 2 stages of
join removals. Stage 1 would be the existing stage that attempts to remove
joins from non subqueries. Stage 2 would happen just after make_one_rel()
is called from query_planner(), this would be to attempt to remove any
subqueries that are not need, and if it managed to remove any it would
force a 2nd call to make_one_rel().

Does this sound reasonable or does it sound like complete non-sense?

Show quoted text

Regards

David Rowley

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#2)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

It looks like the existing join removals are done quite early in the
planning and redundant joins are removed before any subqueries from that
query are planned. So this innerrel->subroot->parse has not been done yet.
It seems to be done later in query_planner() when make_one_rel() is called.

It's true that we don't plan the subquery till later, but I don't see why
that's important here. Everything you want to know is available from the
subquery parsetree; so just look at the RTE, don't worry about how much
of the RelOptInfo has been filled in.

The best I can come up with on how to implement this is to have 2 stages of
join removals. Stage 1 would be the existing stage that attempts to remove
joins from non subqueries. Stage 2 would happen just after make_one_rel()
is called from query_planner(), this would be to attempt to remove any
subqueries that are not need, and if it managed to remove any it would
force a 2nd call to make_one_rel().

That sounds like a seriously bad idea. For one thing, it blows the
opportunity to not plan the subquery in the first place. For another,
most of these steps happen in a carefully chosen order because there
are interdependencies. You can't just go back and re-run some earlier
processing step. A large fraction of the complexity of analyzejoins.c
right now arises from the fact that it has to undo some earlier
processing; that would get enormously worse if you delayed it further.

BTW, just taking one step back ... this seems like a pretty specialized
requirement. Are you sure it wouldn't be easier to fix your app to
not generate such silly queries?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#3)

Re: Allowing join removals for more join types

On Sun, May 18, 2014 at 2:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

It looks like the existing join removals are done quite early in the
planning and redundant joins are removed before any subqueries from that
query are planned. So this innerrel->subroot->parse has not been done

yet.

It seems to be done later in query_planner() when make_one_rel() is

called.

It's true that we don't plan the subquery till later, but I don't see why
that's important here. Everything you want to know is available from the
subquery parsetree; so just look at the RTE, don't worry about how much
of the RelOptInfo has been filled in.

Thanks. I think I've found what you're talking about in PlannerInfo
simple_rte_array.
That's got the ball rolling.

The best I can come up with on how to implement this is to have 2 stages

of

join removals. Stage 1 would be the existing stage that attempts to

remove

joins from non subqueries. Stage 2 would happen just after make_one_rel()
is called from query_planner(), this would be to attempt to remove any
subqueries that are not need, and if it managed to remove any it would
force a 2nd call to make_one_rel().

That sounds like a seriously bad idea. For one thing, it blows the
opportunity to not plan the subquery in the first place. For another,
most of these steps happen in a carefully chosen order because there
are interdependencies. You can't just go back and re-run some earlier
processing step. A large fraction of the complexity of analyzejoins.c
right now arises from the fact that it has to undo some earlier
processing; that would get enormously worse if you delayed it further.

Agreed, but at the time I didn't know that the subquery information was
available elsewhere.

BTW, just taking one step back ... this seems like a pretty specialized
requirement. Are you sure it wouldn't be easier to fix your app to
not generate such silly queries?

Well, couldn't you say the same about any join removals? Of course the
query could be rewritten to not reference that relation, but there are
cases where removing the redundant join might be more silly, for example a
fairly complex view may exist and many use cases for the view don't require
all of the columns. It might be a bit of a pain to maintain a whole series
of views with each required subset of columns instead of just maintaining a
single view and allow callers to use what they need from it.

I came across the need for this at work this week where we have a grid in a
UI where the users can select columns that they need to see in the grid.
The data in each grid is supplied by a single view which contains all the
possible columns that they might need, if the user is just using a narrow
subset of those columns then it could seem quite wasteful to do more than
is required.

Regards

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: David Rowley (#1)

1 attachment(s)

Re: Allowing join removals for more join types

On Sat, May 17, 2014 at 8:57 PM, David Rowley <dgrowleyml@gmail.com> wrote:

I'm currently in the early stages of looking into expanding join removals.

Currently left outer joins can be removed if none of the columns of the
table are required for anything and the table being joined is a base table
that contains a unique index on all columns in the join clause.

The case I would like to work on is to allow sub queries where the query
is grouped by or distinct on all of the join columns.

Take the following as an example:

CREATE TABLE products (productid integer NOT NULL, code character
varying(32) NOT NULL);
CREATE TABLE sales (saleid integer NOT NULL, productid integer NOT NULL,
qty integer NOT NULL);

CREATE VIEW product_sales AS
SELECT p.productid,
p.code,
s.qty
FROM (products p
LEFT JOIN ( SELECT sales.productid,
sum(sales.qty) AS qty
FROM sales
GROUP BY sales.productid) s ON ((p.productid = s.productid)));

If a user does:
SELECT productid,code FROM product_sales;
Then, if I'm correct, the join on sales can be removed.

Attached is a patch which implements this. It's still a bit rough around
the edges and some names could likely do with being improved, but it at
least seems to work with all of the test cases that I've thrown at it so
far.

Comments are welcome, but the main purpose of the email is so I can
register the patch for the June commitfest.

Regards

David Rowley

Attachments:

subquery_leftjoin_removal_v0.5.patchapplication/octet-stream; name=subquery_leftjoin_removal_v0.5.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..e65c21b 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,13 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool	sortclause_is_unique_on_restrictinfo(Query *query,
+												 List *clause_list, List *sortclause);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -154,11 +158,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 	/*
 	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * unique indexes and left joins to a subquery where the subquery is
+	 * unique on the join condition. We can check most of these criteria
+	 * pretty trivially to avoid doing useless extra work.  But checking
+	 * whether any of the indexes are unique would require iterating over
+	 * the indexlist, so for now, if we're joining to a relation, we'll just
+	 * ensure that we have at least 1 index, it won't matter if that index
+	 * is unique at this stage, we'll check those details later.
 	 */
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
@@ -168,11 +174,17 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		if (innerrel->indexlist == NIL)
+			return false; /* no possibility of a unique index */
+	}
+	else if (innerrel->rtekind != RTE_SUBQUERY)
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,16 +288,128 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for the join
+	 * clause if item in the sub query's GROUP BY clause is also used in the join clause
+	 * using equality. This works the same way for the DISTINCT clause. We need not pay
+	 * any attention to WHERE or HAVING clauses as these just restrict the results more
+	 * and could not be the cause of duplication in the result set.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Query *query = root->simple_rte_array[innerrelid]->subquery;
+
+		if (sortclause_is_unique_on_restrictinfo(query, clause_list, query->groupClause) ||
+			sortclause_is_unique_on_restrictinfo(query, clause_list, query->distinctClause))
+			return true;
+	}
+	/* XXX is this comment still needed??
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * sortclause_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in sortclause also exist in clause_list.
+ * The function will return true if clause_list is the same as or a superset
+ * of the sortclause. If the sortclause has columns that don't exist in the
+ * clause_list then the query can't be guaranteed unique on the clause_list
+ * columns. Also if the targetList expression contains any volatile functions
+ * then we return false as:
+ * SELECT DISTINCT a+random() FROM (VALUES(1),(1)) a(a);
+ * will most likely return more than 1 row.
+ */
+static bool
+sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortclause)
+{
+	ListCell *l;
+
+	/*
+	 * if this sortclause is empty then the query can't be unique
+	 * on the clause list.
+	 */
+	if (sortclause == NIL)
+		return false;
+
+	/*
+	 * Loop over each sort clause to ensure that we have
+	 * an item in the join conditions that matches it.
+	 * It does not matter if we have more items in the join
+	 * condition than we have in the sort clause
+	 */
+	foreach(l, sortclause)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget = NULL;
+		bool			 matched = false;
+		ListCell   *l1;
+
+		/* search the targetlist for the TargetEntry for this sort clause */
+		/* XXX Surely there is a function to do this for us?? */
+		foreach(l1, query->targetList)
+		{
+			TargetEntry *tle = (TargetEntry *) lfirst(l1);
+
+			if (tle->ressortgroupref == scl->tleSortGroupRef)
+			{
+				sortTarget = tle;
+				break;
+			}
+		}
+
+		if (sortTarget == NULL)
+			elog(ERROR, "Unable to find sort target in targetlist");
+
+		/*
+		 * Since a constant only has 1 value the existence of one here will
+		 * not cause any duplication of the results. We'll simply ignore it!
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+				TargetEntry *tle = get_tle_by_resno(query->targetList, var->varattno);
+
+				/* Can't remove join if the expression contains a volatile function */
+				if (contain_volatile_functions((Node *) tle->expr))
+					return false;
+
+				if (tle->resorigtbl == sortTarget->resorigtbl &&
+					tle->resno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else /* XXX what else could it be? */
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 934488a..1555aed 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3098,6 +3098,123 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id;
+            QUERY PLAN             
+-----------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, random()
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 275cb11..153a0bc 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -861,9 +861,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -878,6 +880,53 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

Dilip kumar

dilip.kumar@huawei.com

over 11 years ago

In reply to: David Rowley (#5)

Re: Allowing join removals for more join types

On 18 May 2014 16:38 David Rowley Wrote

Sound like a good idea to me..

I have one doubt regarding the implementation, consider the below query

Create table t1 (a int, b int);
Create table t2 (a int, b int);

Create unique index on t2(b);

select x.a from t1 x left join (select distinct t2.a a1, t2.b b1 from t2) as y on x.a=y.b1; (because of distinct clause subquery will not be pulled up)

In this case, Distinct clause is used on t2.a, but t2.b is used for left Join (t2.b have unique index so this left join can be removed).

So I think now when you are considering this join removal for subqueries then this can consider other case also like unique index inside subquery,
because in attached patch unique index is considered only if its RTE_RELATION

+          if (innerrel->rtekind == RTE_RELATION &&
+                      relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
                       return true;

Correct me if I am missing something..

CREATE TABLE products (productid integer NOT NULL, code character varying(32) NOT NULL);
CREATE TABLE sales (saleid integer NOT NULL, productid integer NOT NULL, qty integer NOT NULL);

If a user does:
SELECT productid,code FROM product_sales;
Then, if I'm correct, the join on sales can be removed.

Attached is a patch which implements this. It's still a bit rough around the edges and some names could likely do with being improved, but it at least seems to work with all of the test cases that I've thrown at it so far.

Comments are welcome, but the main purpose of the email is so I can register the patch for the June commitfest.

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Dilip kumar (#6)

Re: Allowing join removals for more join types

On Mon, May 19, 2014 at 5:47 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:

On 18 May 2014 16:38 David Rowley Wrote

Sound like a good idea to me..

I have one doubt regarding the implementation, consider the below query

Create table t1 (a int, b int);

Create table t2 (a int, b int);

Create unique index on t2(b);

select x.a from *t1 x* left join (select *distinct t2.a a1*, *t2.b b1*from t2) as y on x.a=y.b1; (*because
of distinct clause subquery will not be pulled up*)

In this case, Distinct clause is used on *t2.a, *but* t2.b *is used for
left Join (t2.b have unique index so this left join can be removed).

So I think now when you are considering this join removal for subqueries
then this can consider other case also like unique index inside subquery,

because in attached patch unique index is considered only if its
RTE_RELATION

+ if (innerrel->rtekind == RTE_RELATION &&

+ relation_has_unique_index_for(root, innerrel,
clause_list, NIL, NIL))

return true;

Correct me if I am missing something..

I think you are right here, it would be correct to remove that join, but I
also think that the query in question could be quite easily be written as:

select t1.a from t1 left join t2 on t1.a=t2.b;

Where the join WILL be removed. The distinct clause here technically is a
no-op due to all the columns of a unique index being present in the clause.
Can you think of a use case for this where the sub query couldn't have been
written out as a direct join to the relation?

What would be the reason to make it a sub query with the distinct? or have
I gotten something wrong here?

I'm also thinking here that if we made the join removal code remove these
joins, then the join removal code would end up smarter than the rest of the
code as the current code seems not to remove the distinct clause for single
table queries where a subset of the columns of a distinct clause match all
the columns of a unique index.

create table pktest (id int primary key);
explain (costs off) select distinct id from pktest;
QUERY PLAN
--------------------------
HashAggregate
Group Key: id
-> Seq Scan on pktest

This could have been rewritten to become: select id from pktest

I feel if we did that sort of optimisation to the join removals, then I'd
guess we'd better put it into other parts of the code too, perhaps
something that could do this should be in the re-writer then once the join
removal code gets to it, the join could be removed.

Can you think of a similar example where the subquery could not have been
written as a direct join to the relation?

Regards

David Rowley

Dilip kumar

dilip.kumar@huawei.com

over 11 years ago

In reply to: David Rowley (#7)

Re: Allowing join removals for more join types

On 19 May 2014 12:15 David Rowley Wrote,

I think you are right here, it would be correct to remove that join, but I also think that the query in question could be quite easily be written as:

select t1.a from t1 left join t2 on t1.a=t2.b;

Where the join WILL be removed. The distinct clause here technically is a no-op due to all the columns of a unique index being present in the clause. Can you think of a use case for this where the sub query couldn't have been written out as a direct join to the relation?

What would be the reason to make it a sub query with the distinct? or have I gotten something wrong here?

I'm also thinking here that if we made the join removal code remove these joins, then the join removal code would end up smarter than the rest of the code as the current code seems not to remove the distinct clause for single table queries where a subset of the columns of a distinct clause match all the columns of a unique index.

Can you think of a similar example where the subquery could not have been written as a direct join to the relation?

I think, you are write that above given query and be written in very simple join.

But what my point is, In any case when optimizer cannot pull up the subquery (because it may have aggregate, group by, order by, limit, distinct etc.. clause),
That time even, It will check Whether join is removable or not only when distinct or group by clause is there if it has unique index then it will not be check, is there no scenario where it will be useful ?

May be we can convert my above example like below --> in this case we have unique index on field a and we are limiting it by first 100 tuple (record are already order because of index)

Create table t1 (a int, b int);
Create table t2 (a int, b int);
Create unique index on t2(a);

create view v1 as
select x.a, y.b
from t1 x left join (select t2.a a1, b from t2 limit 100) as y on x.a=y.a1;

select a from v1; --> for this query I think left join can be removed, But in view since non join field(b) is also projected so this cannot be simplified there.

In your patch, anyway we are having check for distinct and group clause inside subquery, can’t we have check for unique index also ?

Regards,
Dilip

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Dilip kumar (#8)

Re: Allowing join removals for more join types

On Mon, May 19, 2014 at 9:22 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:

On 19 May 2014 12:15 David Rowley Wrote,

May be we can convert my above example like below à in this case we
have unique index on field a and we are limiting it by first 100 tuple
(record are already order because of index)

Create table t1 (a int, b int);

Create table t2 (a int, b int);

Create unique index on t2(a);

create view v1 as

select x.a, y.b

from t1 x left join (select t2.a a1, b from t2 limit 100) as y on
x.a=y.a1;

select a from v1; à for this query I think left join can be removed, But
in view since non join field(b) is also projected so this cannot be
simplified there.

Ok I see what you mean.
I guess then that if we did that then we should also support removals of
join in subqueries of subqueries. e.g:

select t1.* from t1 left join (select t2.uniquecol from (select
t2.uniquecol from t2 limit 1000) t2 limit 100) t2 on t1.id = t2.uniquecol

On my first round of thoughts on this I thought that we could keep looking
into the sub queries until we find that the sub query only queries a single
table or it is not a base relation. If we find one with a single table and
the sub query has no distinct or group bys then I thought we could just
look at the unique indexes similar to how it's done now for a direct table
join. But after giving this more thought, I'm not quite sure if a lack of
DISTINCT and GROUP BY clause is enough for us to permit removing the join.
Would it matter if the sub query did a FOR UPDATE?
I started looking at is_simple_subquery() in prepjointree.c but if all
those conditions were met then the subquery would have been pulled up to a
direct join anyway.

I'm also now wondering if I need to do some extra tests in the existing
code to ensure that the subquery would have had no side affects.

For example:

SELECT t1.* FROM t1
LEFT OUTER JOIN (SELECT id,some_function_that_does_something(id) FROM t2
GROUP BY id) t2 ON t1.id = t2.id;

Regards

David Rowley

#10

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#9)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

I'm also now wondering if I need to do some extra tests in the existing
code to ensure that the subquery would have had no side affects.

You should probably at least refuse the optimization if the subquery's
tlist contains volatile functions.

Functions that return sets might be problematic too [ experiments... ]
Yeah, they are. This behavior is actually a bit odd:

regression=# select q1 from int8_tbl;
q1
------------------
123
123
4567890123456789
4567890123456789
4567890123456789
(5 rows)

regression=# select q1 from int8_tbl group by 1;
q1
------------------
4567890123456789
123
(2 rows)

regression=# select q1,unnest(array[1,2]) as u from int8_tbl;
q1 | u
------------------+---
123 | 1
123 | 2
123 | 1
123 | 2
4567890123456789 | 1
4567890123456789 | 2
4567890123456789 | 1
4567890123456789 | 2
4567890123456789 | 1
4567890123456789 | 2
(10 rows)

regression=# select q1,unnest(array[1,2]) as u from int8_tbl group by 1;
q1 | u
------------------+---
4567890123456789 | 1
4567890123456789 | 2
123 | 1
123 | 2
(4 rows)

EXPLAIN shows that the reason the last case behaves like that is that
the SRF is expanded *after* the grouping step. I'm not entirely sure if
that's a bug --- given the lack of complaints, perhaps not. But it shows
you can't apply this optimization without changing the existing behavior.

I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#10)

1 attachment(s)

Re: Allowing join removals for more join types

On Tue, May 20, 2014 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I'm also now wondering if I need to do some extra tests in the existing
code to ensure that the subquery would have had no side affects.

You should probably at least refuse the optimization if the subquery's
tlist contains volatile functions.

Functions that return sets might be problematic too [ experiments... ]
Yeah, they are. This behavior is actually a bit odd:

...

regression=# select q1,unnest(array[1,2]) as u from int8_tbl group by 1;
q1 | u
------------------+---
4567890123456789 | 1
4567890123456789 | 2
123 | 1
123 | 2
(4 rows)

EXPLAIN shows that the reason the last case behaves like that is that
the SRF is expanded *after* the grouping step. I'm not entirely sure if
that's a bug --- given the lack of complaints, perhaps not. But it shows
you can't apply this optimization without changing the existing behavior.

I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.

regards, tom lane

Yeah that is strange indeed.
I've made some updates to the patch to add some extra checks for any
volatile functions in the target list and set returning functions.
The FOR UPDATE currently does not really need an explicit check as I'm
currently only supporting removals of sub queries that have either GROUP BY
or DISTINCT clauses, none of which allow FOR UPDATE anyway.

Regards

David Rowley

Attachments:

subquery_leftjoin_removal_v0.6.patchapplication/octet-stream; name=subquery_leftjoin_removal_v0.6.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..83cb70c 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,15 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/tlist.h"
+#include "nodes/nodeFuncs.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool	sortclause_is_unique_on_restrictinfo(Query *query,
+												 List *clause_list, List *sortclause);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -154,11 +160,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 	/*
 	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * unique indexes and left joins to a subquery where the subquery is
+	 * unique on the join condition. We can check most of these criteria
+	 * pretty trivially to avoid doing useless extra work.  But checking
+	 * whether any of the indexes are unique would require iterating over
+	 * the indexlist, so for now, if we're joining to a relation, we'll just
+	 * ensure that we have at least 1 index, it won't matter if that index
+	 * is unique at this stage, we'll check those details later.
 	 */
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
@@ -168,11 +176,17 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		if (innerrel->indexlist == NIL)
+			return false; /* no possibility of a unique index */
+	}
+	else if (innerrel->rtekind != RTE_SUBQUERY)
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,16 +290,143 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform before we even look at the GROUP BY or DISTINCT
+	 * clauses. These are described below.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Query *subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set returning
+		 * functions as these may cause the query not to be unique on the grouping
+		 * columns, as per the following example:
+		 * select a.a,generate_series(1,10) from (values(1)) a(a) group by a
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile functions.
+		 * Doing so may remove desired side affects that calls to the function may
+		 * cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the group by expressions have matching
+		 * items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the distinct column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+
+		/*
+		 * Note that we must also not remove the join in the subquery contains
+		 * a FOR UDPATE. We can actually skip this check as GROUP BY or DISTINCT
+		 * cannot be used at the same time as FOR UPDATE
+		 */
+	}
+	/* XXX is this comment still needed??
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * sortclause_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in sortclause also exist in clause_list.
+ * The function will return true if clause_list is the same as or a superset
+ * of the sortclause. If the sortclause has columns that don't exist in the
+ * clause_list then the query can't be guaranteed unique on the clause_list
+ * columns. Also if the targetList expression contains any volatile functions
+ * then we return false as something like:
+ * SELECT DISTINCT a+random() FROM (VALUES(1),(1)) a(a);
+ * will almost always return more than 1 row.
+ *
+ * Note: The calling function must ensure that sortclause is not NIL.
+ */
+static bool
+sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortclause)
+{
+	ListCell *l;
+
+	Assert(sortclause != NIL);
+
+	/*
+	 * Loop over each sort clause to ensure that we have
+	 * an item in the join conditions that matches it.
+	 * It does not matter if we have more items in the join
+	 * condition than we have in the sort clause.
+	 */
+	foreach(l, sortclause)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * Since a constant only has 1 value the existence of one here will
+		 * not cause any duplication of the results. We'll simply ignore it!
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else /* XXX what else could it be? */
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 934488a..ff13c76 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3060,9 +3060,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3098,6 +3100,151 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 275cb11..d00e7fe 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -861,9 +861,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -878,6 +880,61 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

#12

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Dilip kumar (#6)

Re: Allowing join removals for more join types

On Mon, May 19, 2014 at 5:47 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:

So I think now when you are considering this join removal for subqueries
then this can consider other case also like unique index inside subquery,

because in attached patch unique index is considered only if its
RTE_RELATION

+ if (innerrel->rtekind == RTE_RELATION &&

+ relation_has_unique_index_for(root, innerrel,
clause_list, NIL, NIL))

return true;

I've just had a bit of a look at implementing checks allowing subqueries
with unique indexes on the join cols being removed, but I'm hitting a bit
of a problem and I'm not quite sure if this is possible at this stage of
planning.

In the function join_is_removable() the variable innerrel is set to the
RelOptInfo of the relation which we're checking if we can remove. In the
case of removing subqueries the innerrel->rtekind will be RTE_SUBQUERY. I
started going over the pre-conditions that the sub query will need to meet
for this to be possible and the list so far looks something like:

1. Only a single base table referenced in the sub query.
2. No FOR UPDATE clause
3. No GROUP BY or DISTINCT clause
4. No set returning functions
5. no volatile functions.
6. has unique index that covers the join conditions or a subset of.

I'm hitting a bit of a roadblock on point 1. Here's a snipped from my
latest attempt:

if (bms_membership(innerrel->relids) == BMS_SINGLETON)
{
int subqueryrelid = bms_singleton_member(innerrel->relids);
RelOptInfo *subqueryrel = find_base_rel(innerrel->subroot, subqueryrelid);
if (relation_has_unique_index_for(root, subqueryrel, clause_list, NIL,
NIL))
return true;
}

But it seems that innerrel->subroot is still NULL at this stage of planning
and from what I can tell does not exist anywhere else yet and is not
generated until make_one_rel() is called from query_planner()

Am I missing something major here,or does this sound about right?

Regards

David Rowley

#13

Dilip kumar

dilip.kumar@huawei.com

over 11 years ago

In reply to: David Rowley (#12)

Re: Allowing join removals for more join types

On 23 May 2014 12:43 David Rowley Wrote,

I'm hitting a bit of a roadblock on point 1. Here's a snipped from my latest attempt:

if (bms_membership(innerrel->relids) == BMS_SINGLETON)
{
int subqueryrelid = bms_singleton_member(innerrel->relids);
RelOptInfo *subqueryrel = find_base_rel(innerrel->subroot, subqueryrelid);

if (relation_has_unique_index_for(root, subqueryrel, clause_list, NIL, NIL))
return true;
}

But it seems that innerrel->subroot is still NULL at this stage of planning and from what I can tell does not exist anywhere else yet and is not generated until make_one_rel() is called from query_planner()

Am I missing something major here,or does this sound about right?

It’s true that, till this point of time we haven’t prepared the base relation list for the subquery, and that will be done from make_one_rel while generating the SUBQURY path list.

I can think of one solution but I think it will be messy…

We get the base relation info directly from subquery
Like currently in your patch (shown in below snippet) we are getting the distinct and groupby clause from sub Query, similarly we can get base relation info from (Query->jointree)

if (innerrel->rtekind == RTE_SUBQUERY)
{
Query *query = root->simple_rte_array[innerrelid]->subquery;

if (sortclause_is_unique_on_restrictinfo(query, clause_list, query->groupClause) ||
sortclause_is_unique_on_restrictinfo(query, clause_list, query->distinctClause))
return true;
}

Regards,
Dilip

#14

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Dilip kumar (#13)

Re: Allowing join removals for more join types

On Fri, May 23, 2014 at 8:28 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:

On 23 May 2014 12:43 David Rowley Wrote,

I'm hitting a bit of a roadblock on point 1. Here's a snipped from my

latest attempt:

if (bms_membership(innerrel->relids) ==

BMS_SINGLETON)

{

int subqueryrelid =

bms_singleton_member(innerrel->relids);

RelOptInfo *subqueryrel =

find_base_rel(innerrel->subroot, subqueryrelid);

if (relation_has_unique_index_for(root,

subqueryrel, clause_list, NIL, NIL))

return true;

}

But it seems that innerrel->subroot is still NULL at this stage of

planning and from what I can tell does not exist anywhere else yet and is
not generated until make_one_rel() is called from query_planner()

Am I missing something major here,or does this sound about right?

It’s true that, till this point of time we haven’t prepared the base
relation list for the subquery, and that will be done from make_one_rel
while generating the SUBQURY path list.

I can think of one solution but I think it will be messy…

We get the base relation info directly from subquery

Like currently in your patch (shown in below snippet) we are getting the
distinct and groupby clause from sub Query, similarly we can get base
relation info from (Query->jointree)

if (innerrel->rtekind == RTE_SUBQUERY)

{

Query *query =
root->simple_rte_array[innerrelid]->subquery;

if (sortclause_is_unique_on_restrictinfo(query,
clause_list, query->groupClause) ||

sortclause_is_unique_on_restrictinfo(query, clause_list,
query->distinctClause))

return true;

}

I'm getting the idea that this is just not the right place in planning to
do this for subqueries.
You seem to be right about the messy part too

Here's a copy and paste of the kludge I've ended up with while testing this
out:

if (list_length(subquery->jointree->fromlist) == 1)
{
RangeTblEntry *base_rte;
RelOptInfo *subqueryrelid;
RangeTblRef *rtr = (RangeTblRef *) linitial(subquery->jointree->fromlist);
if (!IsA(rtr, RangeTblRef))
return false;

base_rte = rt_fetch(rtr->rtindex, subquery->rtable);
if (base_rte->relkind != RTE_RELATION)
return false;

subqueryrelid = build_simple_rel(<would have to fake this>, rtr->rtindex,
RELOPT_BASEREL);

I don't have a PlannerInfo to pass to build_simple_rel and it just seems
like a horrid hack to create one that we're not going to be keeping.
Plus It would be a real shame to have to call build_simple_rel() for the
same relation again when we plan the subquery later.

I'm getting the idea that looking for unique indexes on the sub query is
not worth the hassle for now. Don't get me wrong, they'd be nice to have,
but I just think that it's a less common use case and these are more likely
to have been pulled up anyway.

Unless there's a better way, I think I'm going to spend the time looking
into inner joins instead.

Regards

David Rowley

Show quoted text

Regards,

Dilip

#15

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#12)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

I've just had a bit of a look at implementing checks allowing subqueries
with unique indexes on the join cols being removed,

I'm a bit confused by this statement of the problem. I thought the idea
was to recognize that subqueries with DISTINCT or GROUP BY clauses produce
known-unique output column(s), which permits join removal in the same way
that unique indexes on a base table allow us to deduce that certain
columns are known-unique and hence can offer no more than one match for
a join. That makes it primarily a syntactic check, which you can perform
despite the fact that the subquery hasn't been planned yet (since the
parser has done sufficient analysis to determine the semantics of
DISTINCT/GROUP BY).

Drilling down into the subquery is a whole different matter. For one
thing, there's no point in targeting cases in which the subquery would be
eligible to be flattened into the parent query, and your proposed list of
restrictions seems to eliminate most cases in which it couldn't be
flattened. For another, you don't have access to any planning results for
the subquery yet, which is the immediate problem you're complaining of.
Duplicating the work of looking up a relation's indexes seems like a
pretty high price to pay for whatever improvement you might get here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#15)

Re: Allowing join removals for more join types

On Sat, May 24, 2014 at 3:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I've just had a bit of a look at implementing checks allowing subqueries
with unique indexes on the join cols being removed,

I'm a bit confused by this statement of the problem. I thought the idea
was to recognize that subqueries with DISTINCT or GROUP BY clauses produce
known-unique output column(s), which permits join removal in the same way
that unique indexes on a base table allow us to deduce that certain
columns are known-unique and hence can offer no more than one match for
a join. That makes it primarily a syntactic check, which you can perform
despite the fact that the subquery hasn't been planned yet (since the
parser has done sufficient analysis to determine the semantics of
DISTINCT/GROUP BY).

Up thread a little Dilip was talking about in addition to checking that if
the sub query could be proved to be unique on the join condition using
DISTINCT/GROUP BY, we might also check unique indexes in the subquery to
see if they could prove the query is unique on the join condition.

For example a query such as:

SELECT a.* FROM a LEFT JOIN (SELECT b.* FROM b LIMIT 1) b ON a.column =
b.colwithuniqueidx

The presence of the LIMIT would be enough to stop the subquery being pulled
up, but there'd be no reason to why the join couldn't be removed.

I think the use case for this is likely a bit more narrow than the GROUP
BY/DISTINCT case, so I'm planning on using the time on looking into more
common cases such as INNER JOINs where we can prove the existence of the
row using a foreign key.

Drilling down into the subquery is a whole different matter. For one
thing, there's no point in targeting cases in which the subquery would be
eligible to be flattened into the parent query, and your proposed list of
restrictions seems to eliminate most cases in which it couldn't be
flattened. For another, you don't have access to any planning results for
the subquery yet, which is the immediate problem you're complaining of.
Duplicating the work of looking up a relation's indexes seems like a
pretty high price to pay for whatever improvement you might get here.

I agree that there are not many cases left to remove the join that remain
after is_simple_subquery() has decided not to pullup the subquery. Some of
the perhaps more common cases would be having windowing functions in the
subquery as this is what you need to do if you want to include the results
of a windowing function from within the where clause. Another case, though
I can't imagine it would be common, is ORDER BY in the subquery... But for
that one I can't quite understand why is_simple_subquery() stops that being
flattened in the first place.

Regards

David Rowley

#17

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#16)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

I agree that there are not many cases left to remove the join that remain
after is_simple_subquery() has decided not to pullup the subquery. Some of
the perhaps more common cases would be having windowing functions in the
subquery as this is what you need to do if you want to include the results
of a windowing function from within the where clause. Another case, though
I can't imagine it would be common, is ORDER BY in the subquery... But for
that one I can't quite understand why is_simple_subquery() stops that being
flattened in the first place.

The problem there is that (in general) pushing qual conditions to below a
window function will change the window function's results. If we flatten
such a subquery then the outer query's quals can get evaluated before
the window function, so that's no good. Another issue is that flattening
might cause the window function call to get copied to places in the outer
query where it can't legally go, such as the WHERE clause.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: David Rowley (#14)

1 attachment(s)

Re: Allowing join removals for more join types

On Fri, May 23, 2014 at 11:45 PM, David Rowley <dgrowleyml@gmail.com> wrote:

I'm getting the idea that looking for unique indexes on the sub query is
not worth the hassle for now. Don't get me wrong, they'd be nice to have,
but I just think that it's a less common use case and these are more likely
to have been pulled up anyway.

Unless there's a better way, I think I'm going to spend the time looking
into inner joins instead.

I've been working on adding join removal for join types other than left
outer joins.

The attached patch allows join removals for both sub queries with left
joins and also semi joins where a foreign key can prove the existence of
the record.

My longer term plan is to include inner joins too, but now that I have
something to show for semi joins, I thought this would be a good time to
post the patch just in case anyone can see any show stopper's with using
foreign keys this way.

So with the attached you can do:

CREATE TABLE b (id INT NOT NULL PRIMARY KEY);
CREATE TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES
b(id));

EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
QUERY PLAN
---------------
Seq Scan on a
(1 row)

I think anti joins could use the same infrastructure but I'm not quite sure
yet how to go about replacing the join with something like WHERE false.

I do think semi and anti joins are a far less useful case for join removals
as inner joins are, but if we're already loading the foreign key
constraints at plan time, then it seems like something that might be worth
while checking.

Oh, quite likely the code that loads the foreign key constraints needs more
work and probably included in the rel cache, but I don't want to go and to
that until I get some feedback on the work so far.

Any comments are welcome.

Thanks

David Rowley

Attachments:

join_removal_793f19f_2014-05-28.patchapplication/octet-stream; name=join_removal_793f19f_2014-05-28.patchDownload

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 42dcb11..93a8750 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -21,6 +21,7 @@
 #include "access/sysattr.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_collation.h"
+#include "catalog/pg_constraint.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..d2e8e7a 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,23 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/tlist.h"
+#include "nodes/nodeFuncs.h"
+#include "nodes/pg_list.h"
+#include "utils/lsyscache.h"
 
 /* local functions */
-static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool leftjoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool semijoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool sortclause_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortclause);
+static bool relation_has_foreign_key_for(PlannerInfo *root, RelOptInfo *rel,
+					  RelOptInfo *referencedrel, List *referencing_exprs,
+					  List *index_exprs, List *operator_list);
+static bool expressions_match_foreign_key(ForeignKeyInfo *fk, IndexOptInfo *ind,
+					  List *exprlist, List *index_exprs, List *operator_list);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -59,10 +73,20 @@ restart:
 		int			innerrelid;
 		int			nremoved;
 
-		/* Skip if not removable */
-		if (!join_is_removable(root, sjinfo))
-			continue;
-
+		if (sjinfo->jointype == JOIN_LEFT)
+		{
+			/* Skip if not removable */
+			if (!leftjoin_is_removable(root, sjinfo))
+				continue;
+		}
+		else if (sjinfo->jointype == JOIN_SEMI)
+		{
+			/* Skip if not removable */
+			if (!semijoin_is_removable(root, sjinfo))
+				continue;
+		}
+		else
+			continue; /* we don't support this join type */
 		/*
 		 * Currently, join_is_removable can only succeed when the sjinfo's
 		 * righthand is a single baserel.  Remove that rel from the query and
@@ -132,47 +156,75 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
 }
 
 /*
- * join_is_removable
- *	  Check whether we need not perform this special join at all, because
+ * leftjoin_is_removable
+ *	  Check whether we need not perform this left join at all, because
  *	  it will just duplicate its left input.
  *
  * This is true for a left join for which the join condition cannot match
- * more than one inner-side row.  (There are other possibly interesting
- * cases, but we don't have the infrastructure to prove them.)  We also
- * have to check that the inner side doesn't generate any variables needed
- * above the join.
+ * more than one inner-side row. To prove the join will be unique on the
+ * join condition we must analyze the unique indexes on the right side of
+ * the join to ensure that no more than 1 record can exist for the join
+ * condition.
+ *
+ * We can also remove sub queries if we can prove the query will not produce
+ * more than 1 record for the join condition, to do this we currently look at
+ * the GROUP BY and DISTINCT clause of the query.
  */
 static bool
-join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
+leftjoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
+	List	   *fklist = NIL;
+
+	Assert(sjinfo->jointype == JOIN_LEFT);
 
 	/*
 	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * unique indexes and left joins to a subquery where the subquery is
+	 * unique on the join condition. We can check most of these criteria
+	 * pretty trivially to avoid doing useless extra work.  But checking
+	 * whether any of the indexes are unique would require iterating over
+	 * the indexlist, so for now, if we're joining to a relation, we'll just
+	 * ensure that we have at least 1 index, it won't matter if that index
+	 * is unique at this stage, we'll check those details later.
 	 */
-	if (sjinfo->jointype != JOIN_LEFT ||
-		sjinfo->delay_upper_joins ||
+	if (sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
 		return false;
 
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		if (innerrel->indexlist == NIL)
+			return false; /* no possibility of a unique index */
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * The only means we currently use to check if the subquery is unique
+		 * are the GROUP BY and DISTINCT clause. If both of these are empty
+		 * then there's no point in going any further.
+		 */
+		if (subquery->groupClause == NIL &&
+			subquery->distinctClause == NIL)
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -275,17 +327,437 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 * clauses for the innerrel, so we needn't do that here.
 	 */
 
-	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
-		return true;
+	/*
+	 * Now examine the indexes to see if we have a matching unique index.*/
+	if (innerrel->rtekind == RTE_RELATION)
+		return relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL);
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform which could cause duplicate values even if
+	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 *
+	 * NB: We must also not remove the join in the subquery contains a
+	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
+	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set
+		 * returning functions as these may cause the query not to be unique
+		 * on the grouping columns, as per the following example:
+		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile
+		 * functions. Doing so may remove desired side affects that calls
+		 * to the function may cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the GROUP BY expressions
+		 * have matching items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the DISTINCT column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+	}
+	/* XXX is this comment still needed??
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * semijoin_is_removable
+ *	  Check if we can remove this semi join.
+ *
+ * To prove that a semi join is redundant we have to ensure that a foreign key
+ * exists on the left side of the join which references the table at the right
+ * side of the join. This means that we can only support a single table on
+ * either side of the join. We must also ensure that the join condition matches
+ * all the foreign key columns to each index column on the referenced table. If
+ * any columns are missing then we cannot be sure we'll get exactly 1 record back,
+ * and if there are any extra conditions that are not in the foreign key then we
+ * cannot be sure that the join condition will produce at least 1 matching row.
+ */
+static bool
+semijoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
+{
+	int			innerrelid;
+	int			outerrelid;
+	RelOptInfo *innerrel;
+	RelOptInfo *outerrel;
+	ListCell   *lc;
+	List	   *referencing_exprs;
+	List	   *index_exprs;
+	List	   *operator_list;
+
+	Assert(sjinfo->jointype == JOIN_SEMI);
+
+	if (sjinfo->delay_upper_joins ||
+		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
+		return false;
+
+	innerrelid = bms_singleton_member(sjinfo->min_righthand);
+	innerrel = find_base_rel(root, innerrelid);
+
+	if (innerrel->reloptkind != RELOPT_BASEREL ||
+		innerrel->rtekind != RTE_RELATION ||
+		innerrel->indexlist == NIL ||
+		bms_membership(sjinfo->min_lefthand) != BMS_SINGLETON)
+		return false;
+
+	/*
+	 * To allow the removal of a SEMI JOIN we must analyze the foreign
+	 * keys of the relation on the left side of the join, for this to work
+	 * we'll need to ensure that there is only 1 relation on the left side
+	 * of the joins, otherwise there's no possibility of foreign keys.
+	 * If the relation on the left side has no foreign keys then there's
+	 * no possibility that the join can be removed.
+	 */
+
+	outerrelid = bms_singleton_member(sjinfo->min_lefthand);
+	outerrel = find_base_rel(root, outerrelid);
+
+	/* No possibility to remove the join if there's no foreign keys */
+	if (outerrel->fklist == NIL)
+		return false;
+
+	referencing_exprs = NIL;
+	index_exprs = NIL;
+	operator_list = NIL;
+
+	foreach(lc, sjinfo->join_quals)
+	{
+		OpExpr	   *op = (OpExpr *) lfirst(lc);
+		Oid			opno;
+		Node	   *left_expr;
+		Node	   *right_expr;
+		Relids		left_varnos;
+		Relids		right_varnos;
+		Relids		all_varnos;
+		Oid			opinputtype;
+
+		/* Is it a binary opclause? */
+		if (!IsA(op, OpExpr) ||
+			list_length(op->args) != 2)
+		{
+			/* No, but does it reference both sides? */
+			all_varnos = pull_varnos((Node *) op);
+			if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
+				bms_is_subset(all_varnos, sjinfo->syn_righthand))
+			{
+				/*
+				 * Clause refers to only one rel, so ignore it --- unless it
+				 * contains volatile functions, in which case we'd better
+				 * punt.
+				 */
+				if (contain_volatile_functions((Node *) op))
+					return false;
+				continue;
+			}
+			/* Non-operator clause referencing both sides, must punt */
+			return false;
+		}
+
+		/* Extract data from binary opclause */
+		opno = op->opno;
+		left_expr = linitial(op->args);
+		right_expr = lsecond(op->args);
+		left_varnos = pull_varnos(left_expr);
+		right_varnos = pull_varnos(right_expr);
+		all_varnos = bms_union(left_varnos, right_varnos);
+		opinputtype = exprType(left_expr);
+
+		/* Does it reference both sides? */
+		if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
+			bms_is_subset(all_varnos, sjinfo->syn_righthand))
+		{
+			/*
+			 * Clause refers to only one rel, so ignore it --- unless it
+			 * contains volatile functions, in which case we'd better punt.
+			 */
+			if (contain_volatile_functions((Node *) op))
+				return false;
+			continue;
+		}
+
+		/* check rel membership of arguments */
+		if (!bms_is_empty(right_varnos) &&
+			bms_is_subset(right_varnos, sjinfo->syn_righthand) &&
+			!bms_overlap(left_varnos, sjinfo->syn_righthand))
+		{
+			/* typical case, right_expr is RHS variable */
+		}
+		else if (!bms_is_empty(left_varnos) &&
+				 bms_is_subset(left_varnos, sjinfo->syn_righthand) &&
+				 !bms_overlap(right_varnos, sjinfo->syn_righthand))
+		{
+			Node *tmp;
+			/* flipped case, left_expr is RHS variable */
+			opno = get_commutator(opno);
+			if (!OidIsValid(opno))
+				return false;
+
+			/* swap the operands */
+			tmp = left_expr;
+			left_expr = right_expr;
+			right_expr = tmp;
+		}
+		else
+			return false;
+
+		/* so far so good, keep building lists */
+		referencing_exprs = lappend(referencing_exprs, copyObject(left_expr));
+		operator_list = lappend_oid(operator_list, opno);
+		index_exprs = lappend(index_exprs, copyObject(right_expr));
+	}
+
+	if (referencing_exprs == NIL)
+		return false;
+
+	/* The expressions mustn't be volatile. */
+	if (contain_volatile_functions((Node *) referencing_exprs))
+		return false;
+
+	if (contain_volatile_functions((Node *) index_exprs))
+		return false;
+
+	return relation_has_foreign_key_for(root, outerrel, innerrel,
+			referencing_exprs, index_exprs, operator_list);
+}
+
+/*
+ * relation_has_foreign_key_for
+ *	  Checks if rel has a foreign key which references referencedrel with the
+ *	  given list of expressions.
+ *
+ *	For the match to succeed:
+ *	  referencing_exprs must match the columns defined in the foreign key
+ *	  index_exprs must match the columns defined in the index for the foreign key.
+ */
+static bool
+relation_has_foreign_key_for(PlannerInfo *root, RelOptInfo *rel,
+			RelOptInfo *referencedrel, List *referencing_exprs,
+			List *index_exprs, List *operator_list)
+{
+	ListCell *lc;
+
+	Assert(list_length(referencing_exprs) == list_length(index_exprs));
+	Assert(list_length(referencing_exprs) == list_length(operator_list));
+
+	/*
+	 * Short-circuit if no foreign keys exist on the relation or
+	 * there are no indexes on the referenced relation. Remember that
+	 * it is possible for the fklist to not be empty and the indexlist
+	 * to be empty as the foreign keys may be for some completely other
+	 * relation.
+	 */
+	if (rel->fklist == NIL || referencedrel->indexlist == NIL)
+		return false;
+
+	/*
+	 * Here we must look at each foreign key which is defined and see if we
+	 * can find that foreign key's index in the index list of the referenced
+	 * table. When we find a match we do some quick pre-checks on the index
+	 * then we try to see if all of the expressions can be matched to that
+	 * foreign key and index. If we don't match then we'll keep trying to
+	 * find another matching foreign key and index list.
+	 */
+	foreach(lc, rel->fklist)
+	{
+		ForeignKeyInfo *fk = (ForeignKeyInfo *) lfirst(lc);
+		ListCell *ic;
+
+		/*
+		 * We need to ensure that if the number of columns in the key is above zero
+		 * that the foreign key is of type MATCH FULL. XXX is this overly strict??
+		 */
+		if (fk->conncols > 1 && fk->confmatchtype != FKCONSTR_MATCH_FULL)
+			continue;
+
+		foreach(ic, referencedrel->indexlist)
+		{
+			IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
+			if (fk->conindid == ind->indexoid)
+			{
+				/* Sanity check? XXX Should we complain or just skip this one? */
+				if (fk->conncols != ind->ncolumns)
+					elog(ERROR, "Number of columns in foreign key does not match number of indexed columns");
+
+				/* Index not ready? XXX Perhaps this should be an error as we
+				 * should only have fks that have been validated.
+				 */
+				if (!ind->unique || !ind->immediate ||
+					(ind->indpred != NIL && !ind->predOK))
+					continue;
+
+				if (expressions_match_foreign_key(fk, ind, referencing_exprs, index_exprs, operator_list))
+					return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+static bool
+expressions_match_foreign_key(ForeignKeyInfo *fk, IndexOptInfo *ind,
+			List *exprlist, List *index_exprs, List *operator_list)
+{
+	ListCell *lc;
+	ListCell *lc2;
+	ListCell *lc3;
+	int col;
+
+	Assert(list_length(exprlist) == list_length(index_exprs));
+	Assert(list_length(exprlist) == list_length(operator_list));
+
+	/*
+	 * For each column defined in the foreign key we must ensure that we find
+	 * a qual in the expression list which matches the foreign key on one side
+	 * of the expression and the index on the other side of the expression. It
+	 * does not matter if the same expression appears many times, we just need
+	 * to ensure all exist at least one and no extra non matching expressions
+	 * exist.
+	 */
+
+	/*
+	 * Fast path out if there's not enough conditions to match
+	 * each column in the foreign key. Note that we cannot check
+	 * that the number of expressions is equal here since it would
+	 * cause duplicate expressions to not match.
+	 */
+	if (list_length(exprlist) < fk->conncols)
+		return false;
+
+	forthree(lc, exprlist, lc2, index_exprs, lc3, operator_list)
+	{
+		Node	*expr = (Node *) lfirst(lc);
+		Node	*idxexpr = (Node *) lfirst(lc2);
+		Oid		opr = lfirst_oid(lc3);
+		bool matched = false;
+
+		/* if anything is NULL or not a var then we can it's not a match */
+		if (!expr || !IsA(expr, Var) || !idxexpr || !IsA(idxexpr, Var))
+			return false;
+
+		for (col = 0; col < fk->conncols; col++)
+		{
+			if (fk->conkey[col] == ((Var *) expr)->varattno &&
+				fk->confkey[col] == ((Var *) idxexpr)->varattno &&
+				opr == fk->conpfeqop[col])
+			{
+				matched = true;
+				break;
+			}
+		}
+
+		/*
+		 * Did we find anything matching the fk col? If not then we'll
+		 * return a no match.
+		 */
+		if (!matched)
+			return false;
+	}
+
+	return true;
+}
+
+
+/*
+ * sortclause_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in sortclause also exist in clause_list.
+ * The function will return true if clause_list is the same as or a superset
+ * of the sortclause. If the sortclause has columns that don't exist in the
+ * clause_list then the query can't be guaranteed unique on the clause_list
+ * columns.
+ *
+ * Note: The calling function must ensure that sortclause is not NIL.
+ */
+static bool
+sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortclause)
+{
+	ListCell *l;
+
+	Assert(sortclause != NIL);
+
+	/*
+	 * Loop over each sort clause to ensure that we have
+	 * an item in the join conditions that matches it.
+	 * It does not matter if we have more items in the join
+	 * condition than we have in the sort clause.
+	 */
+	foreach(l, sortclause)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * Since a constant only has 1 value the existence of one here will
+		 * not cause any duplication of the results. We'll simply ignore it!
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else /* XXX what else could it be? */
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index b2becfa..ac7b38b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -26,6 +26,8 @@
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "catalog/heap.h"
+#include "catalog/pg_constraint.h"
+#include "catalog/pg_type.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -38,6 +40,7 @@
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -384,6 +387,123 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 
 	heap_close(relation, NoLock);
 
+
+	{
+		List	   *result = NIL;
+		Relation	fkeyRel;
+		Relation	fkeyRelIdx;
+		ScanKeyData fkeyScankey;
+		SysScanDesc fkeyScan;
+		HeapTuple	tuple;
+
+		/* ConstraintRelidIndexId
+		 * Must scan pg_constraint.  Right now, it is a seqscan because there is
+		 * no available index on conrelid.
+		 */
+		ScanKeyInit(&fkeyScankey,
+			Anum_pg_constraint_conrelid,
+			BTEqualStrategyNumber, F_OIDEQ,
+			ObjectIdGetDatum(relationObjectId));
+
+		fkeyRel = heap_open(ConstraintRelationId, AccessShareLock);
+		fkeyRelIdx = index_open(ConstraintRelidIndexId, AccessShareLock);
+		fkeyScan = systable_beginscan_ordered(fkeyRel, fkeyRelIdx, NULL, 1, &fkeyScankey);
+
+		while ((tuple = systable_getnext_ordered(fkeyScan, ForwardScanDirection)) != NULL)
+		{
+			Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+			ForeignKeyInfo *fkinfo;
+			Datum		adatum;
+			bool		isNull;
+			ArrayType  *arr;
+			int			numkeys;
+
+			/* Not a foreign key */
+			if (con->contype != CONSTRAINT_FOREIGN)
+				continue;
+
+			/* we're not interested unless the fk has been validated */
+			if (!con->convalidated)
+				continue;
+
+			fkinfo = (ForeignKeyInfo *) palloc(sizeof(ForeignKeyInfo));
+			fkinfo->conindid = con->conindid;
+			fkinfo->confrelid = con->confrelid;
+			fkinfo->convalidated = con->convalidated;
+			fkinfo->conrelid = con->conrelid;
+			fkinfo->confupdtype = con->confupdtype;
+			fkinfo->confdeltype = con->confdeltype;
+			fkinfo->confmatchtype = con->confmatchtype;
+
+			adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+				RelationGetDescr(fkeyRel), &isNull);
+			if (isNull)
+				elog(ERROR, "null conkey for constraint %u",
+				HeapTupleGetOid(tuple));
+			arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+			numkeys = ARR_DIMS(arr)[0];
+			if (ARR_NDIM(arr) != 1 ||
+				numkeys < 0 ||
+				ARR_HASNULL(arr) ||
+				ARR_ELEMTYPE(arr) != INT2OID)
+				elog(ERROR, "conkey is not a 1-D smallint array");
+
+			fkinfo->conkey = (int16 *) ARR_DATA_PTR(arr);
+
+			fkinfo->conncols = numkeys;
+
+			adatum = heap_getattr(tuple, Anum_pg_constraint_confkey,
+				RelationGetDescr(fkeyRel), &isNull);
+			if (isNull)
+				elog(ERROR, "null confkey for constraint %u",
+				HeapTupleGetOid(tuple));
+			arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+			numkeys = ARR_DIMS(arr)[0];
+
+			/* sanity check */
+			if (numkeys != fkinfo->conncols)
+				elog(ERROR, "number of confkey elements does not equal conkey elements");
+
+			if (ARR_NDIM(arr) != 1 ||
+				numkeys < 0 ||
+				ARR_HASNULL(arr) ||
+				ARR_ELEMTYPE(arr) != INT2OID)
+				elog(ERROR, "confkey is not a 1-D smallint array");
+
+			fkinfo->confkey = (int16 *) ARR_DATA_PTR(arr);
+
+			adatum = heap_getattr(tuple, Anum_pg_constraint_conpfeqop,
+				RelationGetDescr(fkeyRel), &isNull);
+			if (isNull)
+				elog(ERROR, "null conpfeqop for constraint %u",
+				HeapTupleGetOid(tuple));
+			arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+			numkeys = ARR_DIMS(arr)[0];
+
+			/* sanity check */
+			if (numkeys != fkinfo->conncols)
+				elog(ERROR, "number of conpfeqop elements does not equal conkey elements");
+
+			if (ARR_NDIM(arr) != 1 ||
+				numkeys < 0 ||
+				ARR_HASNULL(arr) ||
+				ARR_ELEMTYPE(arr) != OIDOID)
+				elog(ERROR, "conpfeqop is not a 1-D smallint array");
+
+			fkinfo->conpfeqop = (Oid *) ARR_DATA_PTR(arr);
+
+			result = lappend(result, fkinfo);
+		}
+
+		rel->fklist = result;
+
+		systable_endscan_ordered(fkeyScan);
+		index_close(fkeyRelIdx, AccessShareLock);
+		heap_close(fkeyRel, AccessShareLock);
+	}
+
+
+
 	/*
 	 * Allow a plugin to editorialize on the info we obtained from the
 	 * catalogs.  Actions might include altering the assumed relation size,
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index c938c27..a0fb8eb 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -115,6 +115,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	rel->lateral_relids = NULL;
 	rel->lateral_referencers = NULL;
 	rel->indexlist = NIL;
+	rel->fklist = NIL;
 	rel->pages = 0;
 	rel->tuples = 0;
 	rel->allvisfrac = 0;
@@ -377,6 +378,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->lateral_relids = NULL;
 	joinrel->lateral_referencers = NULL;
 	joinrel->indexlist = NIL;
+	joinrel->fklist = NIL;
 	joinrel->pages = 0;
 	joinrel->tuples = 0;
 	joinrel->allvisfrac = 0;
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 300136e..3deb59b 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -445,6 +445,7 @@ typedef struct RelOptInfo
 	Relids		lateral_relids; /* minimum parameterization of rel */
 	Relids		lateral_referencers;	/* rels that reference me laterally */
 	List	   *indexlist;		/* list of IndexOptInfo */
+	List	   *fklist;			/* list of ForeignKeyInfo */
 	BlockNumber pages;			/* size estimates derived from pg_class */
 	double		tuples;
 	double		allvisfrac;
@@ -1643,4 +1644,20 @@ typedef struct JoinCostWorkspace
 	int			numbatches;
 } JoinCostWorkspace;
 
+typedef struct ForeignKeyInfo
+{
+	Oid			conindid;		/* index supporting this constraint */
+	Oid			confrelid;		/* relation referenced by foreign key */
+	bool		convalidated;	/* constraint has been validated? */
+	Oid			conrelid;		/* relation this constraint constrains */
+	char		confupdtype;	/* foreign key's ON UPDATE action */
+	char		confdeltype;	/* foreign key's ON DELETE action */
+	char		confmatchtype;	/* foreign key's match type */
+	int			conncols;		/* number of columns references */
+	int16	   *conkey;			/* Columns of conrelid that the constraint applies to */
+	int16	   *confkey;		/* columns of confrelid that foreign key references */
+	Oid		   *conpfeqop;		/* Operator list for comparing PK to FK */
+} ForeignKeyInfo;
+
+
 #endif   /* RELATION_H */
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 934488a..fd8e048 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3060,9 +3060,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3098,7 +3100,331 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
+BEGIN;
+-- Test join removals for semi joins
+CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY);
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES b(id));
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- should remove semi join to b (swapped condition order)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id = a.b_id);
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- should not remove semi join (since not using equals)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id >= a.b_id);
+               QUERY PLAN                
+-----------------------------------------
+ Nested Loop Semi Join
+   ->  Seq Scan on a
+   ->  Index Only Scan using b_pkey on b
+         Index Cond: (id >= a.b_id)
+(4 rows)
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id+0 IN(SELECT id FROM b);
+             QUERY PLAN             
+------------------------------------
+ Hash Semi Join
+   Hash Cond: ((a.b_id + 0) = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id+0 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Hash Semi Join
+   Hash Cond: (a.b_id = (b.id + 0))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+-- should not remove semi join (wrong column)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE id IN(SELECT id FROM b);
+         QUERY PLAN         
+----------------------------
+ Hash Join
+   Hash Cond: (b.id = a.id)
+   ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(5 rows)
+
+ROLLBACK;
+BEGIN;
+-- Semi join removal code with 2 column foreign keys
+CREATE TEMP TABLE b (id1 INT NOT NULL, id2 INT NOT NULL, PRIMARY KEY(id1,id2));
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id1 INT NOT NULL, b_id2 INT NOT NULL);
+ALTER TABLE a ADD CONSTRAINT a_b_id1_b_id2_fkey FOREIGN KEY (b_id1,b_id2) REFERENCES b(id1,id2) MATCH FULL;
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2);
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- should not remove semi join to b (extra condition)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2 AND a.b_id2 >= id2);
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Hash Semi Join
+   Hash Cond: ((a.b_id1 = b.id1) AND (a.b_id2 = b.id2))
+   Join Filter: (a.b_id2 >= b.id2)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(6 rows)
+
+-- should not remove semi join to b (wrong operator)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 > id1 AND a.b_id2 < id2);
+                        QUERY PLAN                         
+-----------------------------------------------------------
+ Nested Loop Semi Join
+   ->  Seq Scan on a
+   ->  Index Only Scan using b_pkey on b
+         Index Cond: ((id1 < a.b_id1) AND (id2 > a.b_id2))
+(4 rows)
+
+-- should not remove semi join (only checking id1)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1);
+           QUERY PLAN            
+---------------------------------
+ Hash Join
+   Hash Cond: (a.b_id1 = b.id1)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: b.id1
+               ->  Seq Scan on b
+(7 rows)
+
+-- should not remove semi join (only checking id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id2);
+           QUERY PLAN            
+---------------------------------
+ Hash Join
+   Hash Cond: (a.b_id2 = b.id2)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: b.id2
+               ->  Seq Scan on b
+(7 rows)
+
+-- should not remove semi join (checking wrong columns)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id2);
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Hash Join
+   Hash Cond: ((a.b_id2 = b.id1) AND (a.b_id1 = b.id2))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+-- should not remove semi join (no check for id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id1);
+               QUERY PLAN                
+-----------------------------------------
+ Nested Loop Semi Join
+   ->  Seq Scan on a
+         Filter: (b_id2 = b_id1)
+   ->  Index Only Scan using b_pkey on b
+         Index Cond: (id1 = a.b_id2)
+(5 rows)
+
+-- should not remove semi join (no check for b_id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id1 = id2);
+            QUERY PLAN             
+-----------------------------------
+ Hash Join
+   Hash Cond: (a.b_id1 = b.id1)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: (id1 = id2)
+(6 rows)
+
+ROLLBACK;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
 insert into parent values (1, 10), (2, 20), (3, 30);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 275cb11..984a24f 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -861,9 +861,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -878,8 +880,142 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
+BEGIN;
+
+-- Test join removals for semi joins
+CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY);
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES b(id));
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
+
+-- should remove semi join to b (swapped condition order)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id = a.b_id);
+
+-- should not remove semi join (since not using equals)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id >= a.b_id);
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id+0 IN(SELECT id FROM b);
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id+0 FROM b);
+
+-- should not remove semi join (wrong column)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE id IN(SELECT id FROM b);
+
+ROLLBACK;
+
+BEGIN;
+
+-- Semi join removal code with 2 column foreign keys
+
+CREATE TEMP TABLE b (id1 INT NOT NULL, id2 INT NOT NULL, PRIMARY KEY(id1,id2));
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id1 INT NOT NULL, b_id2 INT NOT NULL);
+
+ALTER TABLE a ADD CONSTRAINT a_b_id1_b_id2_fkey FOREIGN KEY (b_id1,b_id2) REFERENCES b(id1,id2) MATCH FULL;
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2);
+
+-- should not remove semi join to b (extra condition)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2 AND a.b_id2 >= id2);
+
+-- should not remove semi join to b (wrong operator)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 > id1 AND a.b_id2 < id2);
+
+-- should not remove semi join (only checking id1)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1);
+
+-- should not remove semi join (only checking id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id2);
+
+-- should not remove semi join (checking wrong columns)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id2);
+
+-- should not remove semi join (no check for id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id1);
+
+-- should not remove semi join (no check for b_id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id1 = id2);
+
+ROLLBACK;
+
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
 insert into parent values (1, 10), (2, 20), (3, 30);

#19

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#17)

Re: Allowing join removals for more join types

On Sun, May 25, 2014 at 5:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I agree that there are not many cases left to remove the join that remain
after is_simple_subquery() has decided not to pullup the subquery. Some

of

the perhaps more common cases would be having windowing functions in the
subquery as this is what you need to do if you want to include the

results

of a windowing function from within the where clause. Another case,

though

I can't imagine it would be common, is ORDER BY in the subquery... But

for

that one I can't quite understand why is_simple_subquery() stops that

being

flattened in the first place.

The problem there is that (in general) pushing qual conditions to below a
window function will change the window function's results. If we flatten
such a subquery then the outer query's quals can get evaluated before
the window function, so that's no good. Another issue is that flattening
might cause the window function call to get copied to places in the outer
query where it can't legally go, such as the WHERE clause.

I should have explained more clearly. I was meaning that a query such as
this:

SELECT a.* FROM a LEFT OUTER JOIN (SELECT id,LAG(id) OVER (ORDER BY id) AS
prev_id FROM b) b ON a.id=b.id;

assuming that id is the primary key, could have the join removed.
I was just commenting on this as it's probably a fairly common thing to
have a subquery with windowing functions in order to perform some sort of
filtering of window function columns in the outer query.
The other use cases for example:

SELECT a.* FROM a LEFT OUTER JOIN (SELECT id FROM b LIMIT 10) b ON a.id=b.id
;

Are likely less common.

Regards

David Rowley

#20

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: David Rowley (#18)

1 attachment(s)

Re: Allowing join removals for more join types

On Wed, May 28, 2014 at 8:39 PM, David Rowley <dgrowleyml@gmail.com> wrote:

I've been working on adding join removal for join types other than left
outer joins.

The attached patch allows join removals for both sub queries with left
joins and also semi joins where a foreign key can prove the existence of
the record.

My longer term plan is to include inner joins too, but now that I have
something to show for semi joins, I thought this would be a good time to
post the patch just in case anyone can see any show stopper's with using
foreign keys this way.

So with the attached you can do:

CREATE TABLE b (id INT NOT NULL PRIMARY KEY);
CREATE TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES
b(id));

EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
QUERY PLAN
---------------
Seq Scan on a
(1 row)

I think anti joins could use the same infrastructure but I'm not quite
sure yet how to go about replacing the join with something like WHERE false.

I do think semi and anti joins are a far less useful case for join
removals as inner joins are, but if we're already loading the foreign key
constraints at plan time, then it seems like something that might be worth
while checking.

Oh, quite likely the code that loads the foreign key constraints needs
more work and probably included in the rel cache, but I don't want to go
and to that until I get some feedback on the work so far.

Any comments are welcome.

The attached patch fixes a problem with SEMI join removal where I was
missing adding a WHERE col IS NOT NULL check after a successful join
removal. This filter is required to keep the query equivalent as the semi
join would have filtered out the rows with a NULL join condition columns on
the left side of the join.

In the attached I've also added support for ANTI joins, where the join can
be removed it is replaced with a WHERE col IS NULL on the relation on the
left side of the join. This is required as the only possible columns that
could have matched would be NULL valued columns that are part of the
foreign key.

I'm not quite there with inner joins yet. I'm still getting my head around
just where the join quals are actually stored.

This area of the code is quite new to me, so I'm not quite sure I'm going
about things in the correct way.
To make my intentions clean with this patch I've marked the file name with
WIP.

Comments are welcome.

Regards

David Rowley

Attachments:

join_removal_382e741_2014-06-02_WIP.patchapplication/octet-stream; name=join_removal_382e741_2014-06-02_WIP.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..a62122d 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,14 +27,33 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/tlist.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "nodes/pg_list.h"
+#include "utils/lsyscache.h"
+
 
 /* local functions */
-static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool leftjoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool semiorantijoin_is_removable(PlannerInfo *root,
+					  SpecialJoinInfo *sjinfo, List **leftrelcolumns,
+					  RelOptInfo **leftrel);
+static bool sortclause_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortclause);
+static bool relation_has_foreign_key_for(PlannerInfo *root, JoinType jointype,
+					  RelOptInfo *rel, RelOptInfo *referencedrel,
+					  List *referencing_exprs, List *index_exprs,
+					  List *operator_list);
+static bool expressions_match_foreign_key(ForeignKeyInfo *fk, IndexOptInfo *ind,
+					  List *exprlist, List *index_exprs, List *operator_list);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 
-
 /*
  * remove_useless_joins
  *		Check for relations that don't actually need to be joined at all,
@@ -48,10 +67,6 @@ remove_useless_joins(PlannerInfo *root, List *joinlist)
 {
 	ListCell   *lc;
 
-	/*
-	 * We are only interested in relations that are left-joined to, so we can
-	 * scan the join_info_list to find them easily.
-	 */
 restart:
 	foreach(lc, root->join_info_list)
 	{
@@ -59,10 +74,54 @@ restart:
 		int			innerrelid;
 		int			nremoved;
 
-		/* Skip if not removable */
-		if (!join_is_removable(root, sjinfo))
-			continue;
+		if (sjinfo->jointype == JOIN_LEFT)
+		{
+			/* Skip if not removable */
+			if (!leftjoin_is_removable(root, sjinfo))
+				continue;
+		}
+		else if (sjinfo->jointype == JOIN_SEMI || sjinfo->jointype == JOIN_ANTI)
+		{
+			List		 *columnlist;
+			ListCell	 *lc2;
+			RelOptInfo	 *rel;
+			NullTestType nulltestype;
+
+			/* Skip if not removable */
+			if (!semiorantijoin_is_removable(root, sjinfo, &columnlist, &rel))
+				continue;
+
+			Assert(columnlist != NIL);
+
+			/*
+			 * If the a semi join is removable then we still must ensure that
+			 * we don't show any records from the left hand relation that have
+			 * a NULL value in any of the columns which were in the join
+			 * clause. For anti joins the only possible records that could have
+			 * matched are ones which have NULL values.
+			 */
 
+			if (sjinfo->jointype == JOIN_SEMI)
+				nulltestype = IS_NOT_NULL;
+			else
+				nulltestype = IS_NULL;
+
+			foreach(lc2, columnlist)
+			{
+				RestrictInfo *rinfo;
+				Node *node = (Node *) lfirst(lc2);
+				NullTest *ntest = makeNode(NullTest);
+				ntest->nulltesttype = nulltestype;
+				ntest->arg = (Expr *) node;
+				ntest->argisrow = false;
+
+				rinfo = make_restrictinfo((Expr *)ntest, true, false, false,
+							NULL, NULL, NULL);
+				rel->baserestrictinfo = lappend(rel->baserestrictinfo, rinfo);
+			}
+		}
+		else
+			continue; /* we don't support this join type */
 		/*
 		 * Currently, join_is_removable can only succeed when the sjinfo's
 		 * righthand is a single baserel.  Remove that rel from the query and
@@ -132,47 +191,75 @@ clause_sides_match_join(RestrictInfo *rinfo, Relids outerrelids,
 }
 
 /*
- * join_is_removable
- *	  Check whether we need not perform this special join at all, because
+ * leftjoin_is_removable
+ *	  Check whether we need not perform this left join at all, because
  *	  it will just duplicate its left input.
  *
  * This is true for a left join for which the join condition cannot match
- * more than one inner-side row.  (There are other possibly interesting
- * cases, but we don't have the infrastructure to prove them.)  We also
- * have to check that the inner side doesn't generate any variables needed
- * above the join.
+ * more than one inner-side row. To prove the join will be unique on the
+ * join condition we must analyze the unique indexes on the right side of
+ * the join to ensure that no more than 1 record can exist for the join
+ * condition.
+ *
+ * We can also remove sub queries if we can prove the query will not produce
+ * more than 1 record for the join condition, to do this we currently look at
+ * the GROUP BY and DISTINCT clause of the query.
  */
 static bool
-join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
+leftjoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
+	List	   *fklist = NIL;
+
+	Assert(sjinfo->jointype == JOIN_LEFT);
 
 	/*
 	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * unique indexes and left joins to a subquery where the subquery is
+	 * unique on the join condition. We can check most of these criteria
+	 * pretty trivially to avoid doing useless extra work.  But checking
+	 * whether any of the indexes are unique would require iterating over
+	 * the indexlist, so for now, if we're joining to a relation, we'll just
+	 * ensure that we have at least 1 index, it won't matter if that index
+	 * is unique at this stage, we'll check those details later.
 	 */
-	if (sjinfo->jointype != JOIN_LEFT ||
-		sjinfo->delay_upper_joins ||
+	if (sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
 		return false;
 
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		if (innerrel->indexlist == NIL)
+			return false; /* no possibility of a unique index */
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * The only means we currently use to check if the subquery is unique
+		 * are the GROUP BY and DISTINCT clause. If both of these are empty
+		 * then there's no point in going any further.
+		 */
+		if (subquery->groupClause == NIL &&
+			subquery->distinctClause == NIL)
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -275,17 +362,468 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 * clauses for the innerrel, so we needn't do that here.
 	 */
 
-	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
-		return true;
+	/*
+	 * Now examine the indexes to see if we have a matching unique index.*/
+	if (innerrel->rtekind == RTE_RELATION)
+		return relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL);
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform which could cause duplicate values even if
+	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 *
+	 * NB: We must also not remove the join in the subquery contains a
+	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
+	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set
+		 * returning functions as these may cause the query not to be unique
+		 * on the grouping columns, as per the following example:
+		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile
+		 * functions. Doing so may remove desired side affects that calls
+		 * to the function may cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the GROUP BY expressions
+		 * have matching items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the DISTINCT column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+	}
+	/* XXX is this comment still needed??
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * semiorantijoin_is_removable
+ *	  Check if we can remove this semi join or anti join
+ *
+ * To prove that a semi or anti join is redundant we must ensure that a foreign
+ * key exists on the left side of the join which references the table on the
+ * right side of the join. This means that we can only support a single table
+ * on either side of the join. We must also ensure that the join condition
+ * matches all the foreign key columns to each index column on the referenced
+ * table. If any columns are missing then we cannot be sure we'll get exactly
+ * 1 record back, and if there are any extra conditions that are not in the
+ * foreign key then we cannot be sure that the join condition will produce at
+ * least 1 matching row.
+ *
+ * If we manage to find a foreign key which will allow the join to be removed
+ * then the calling code must add NULL checking to the query in place of the
+ * join. For example if we determine that the join to the table b is not needed
+ * due to the existence of a foreign key on a.b_id referencing b.id in the
+ * following query:
+ *
+ * SELECT * FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = b.id);
+ *
+ * Then the only possible records that could be returned from a are the ones
+ * where b_id are NULL.
+ */
+static bool
+semiorantijoin_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo,
+		List **leftrelcolumns, RelOptInfo **leftrel)
+{
+	int			innerrelid;
+	int			outerrelid;
+	RelOptInfo *innerrel;
+	RelOptInfo *outerrel;
+	ListCell   *lc;
+	List	   *referencing_exprs;
+	List	   *index_exprs;
+	List	   *operator_list;
+
+	Assert(sjinfo->jointype == JOIN_SEMI || sjinfo->jointype == JOIN_ANTI);
+
+	if (sjinfo->delay_upper_joins ||
+		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
+		return false;
+
+	innerrelid = bms_singleton_member(sjinfo->min_righthand);
+	innerrel = find_base_rel(root, innerrelid);
+
+	if (innerrel->reloptkind != RELOPT_BASEREL ||
+		innerrel->rtekind != RTE_RELATION ||
+		innerrel->indexlist == NIL ||
+		bms_membership(sjinfo->min_lefthand) != BMS_SINGLETON)
+		return false;
+
+	/*
+	 * To allow the removal of a SEMI JOIN we must analyze the foreign
+	 * keys of the relation on the left side of the join, for this to work
+	 * we'll need to ensure that there is only 1 relation on the left side
+	 * of the joins, otherwise there's no possibility of foreign keys.
+	 * If the relation on the left side has no foreign keys then there's
+	 * no possibility that the join can be removed.
+	 */
+
+	outerrelid = bms_singleton_member(sjinfo->min_lefthand);
+	outerrel = find_base_rel(root, outerrelid);
+	*leftrel = outerrel;
+
+	/* No possibility to remove the join if there's no foreign keys */
+	if (outerrel->fklist == NIL)
+		return false;
+
+	referencing_exprs = NIL;
+	index_exprs = NIL;
+	operator_list = NIL;
+
+	foreach(lc, sjinfo->join_quals)
+	{
+		OpExpr	   *op = (OpExpr *) lfirst(lc);
+		Oid			opno;
+		Node	   *left_expr;
+		Node	   *right_expr;
+		Relids		left_varnos;
+		Relids		right_varnos;
+		Relids		all_varnos;
+		Oid			opinputtype;
+
+		/* Is it a binary opclause? */
+		if (!IsA(op, OpExpr) ||
+			list_length(op->args) != 2)
+		{
+			/* No, but does it reference both sides? */
+			all_varnos = pull_varnos((Node *) op);
+			if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
+				bms_is_subset(all_varnos, sjinfo->syn_righthand))
+			{
+				/*
+				 * Clause refers to only one rel, so ignore it --- unless it
+				 * contains volatile functions, in which case we'd better
+				 * punt.
+				 */
+				if (contain_volatile_functions((Node *) op))
+					return false;
+				continue;
+			}
+			/* Non-operator clause referencing both sides, must punt */
+			return false;
+		}
+
+		/* Extract data from binary opclause */
+		opno = op->opno;
+		left_expr = linitial(op->args);
+		right_expr = lsecond(op->args);
+		left_varnos = pull_varnos(left_expr);
+		right_varnos = pull_varnos(right_expr);
+		all_varnos = bms_union(left_varnos, right_varnos);
+		opinputtype = exprType(left_expr);
+
+		/* Does it reference both sides? */
+		if (!bms_overlap(all_varnos, sjinfo->syn_righthand) ||
+			bms_is_subset(all_varnos, sjinfo->syn_righthand))
+		{
+			/*
+			 * Clause refers to only one rel, so ignore it --- unless it
+			 * contains volatile functions, in which case we'd better punt.
+			 */
+			if (contain_volatile_functions((Node *) op))
+				return false;
+			continue;
+		}
+
+		/* check rel membership of arguments */
+		if (!bms_is_empty(right_varnos) &&
+			bms_is_subset(right_varnos, sjinfo->syn_righthand) &&
+			!bms_overlap(left_varnos, sjinfo->syn_righthand))
+		{
+			/* typical case, right_expr is RHS variable */
+		}
+		else if (!bms_is_empty(left_varnos) &&
+				 bms_is_subset(left_varnos, sjinfo->syn_righthand) &&
+				 !bms_overlap(right_varnos, sjinfo->syn_righthand))
+		{
+			Node *tmp;
+			/* flipped case, left_expr is RHS variable */
+			opno = get_commutator(opno);
+			if (!OidIsValid(opno))
+				return false;
+
+			/* swap the operands */
+			tmp = left_expr;
+			left_expr = right_expr;
+			right_expr = tmp;
+		}
+		else
+			return false;
+
+		/* so far so good, keep building lists */
+		referencing_exprs = lappend(referencing_exprs, copyObject(left_expr));
+		operator_list = lappend_oid(operator_list, opno);
+		index_exprs = lappend(index_exprs, copyObject(right_expr));
+	}
+
+	if (referencing_exprs == NIL)
+		return false;
+
+	/* The expressions mustn't be volatile. */
+	if (contain_volatile_functions((Node *) referencing_exprs))
+		return false;
+
+	if (contain_volatile_functions((Node *) index_exprs))
+		return false;
+
+	if (relation_has_foreign_key_for(root, sjinfo->jointype, outerrel,
+			innerrel, referencing_exprs, index_exprs, operator_list))
+	{
+		*leftrelcolumns = referencing_exprs;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * relation_has_foreign_key_for
+ *	  Checks if rel has a foreign key which references referencedrel with the
+ *	  given list of expressions.
+ *
+ *	For the match to succeed:
+ *	  referencing_exprs must match the columns defined in the foreign key
+ *	  index_exprs must match the columns defined in the index for the foreign key.
+ */
+static bool
+relation_has_foreign_key_for(PlannerInfo *root, JoinType jointype,
+			RelOptInfo *rel, RelOptInfo *referencedrel,
+			List *referencing_exprs, List *index_exprs, List *operator_list)
+{
+	ListCell *lc;
+
+	Assert(list_length(referencing_exprs) == list_length(index_exprs));
+	Assert(list_length(referencing_exprs) == list_length(operator_list));
+
+	/*
+	 * Short-circuit if no foreign keys exist on the relation or
+	 * there are no indexes on the referenced relation. Remember that
+	 * it is possible for the fklist to not be empty and the indexlist
+	 * to be empty as the foreign keys may be for some completely other
+	 * relation.
+	 */
+	if (rel->fklist == NIL || referencedrel->indexlist == NIL)
+		return false;
+
+	/*
+	 * Here we must look at each foreign key which is defined and see if we
+	 * can find that foreign key's index in the index list of the referenced
+	 * table. When we find a match we do some quick pre-checks on the index
+	 * then we try to see if all of the expressions can be matched to that
+	 * foreign key and index. If we don't match then we'll keep trying to
+	 * find another matching foreign key and index list.
+	 */
+	foreach(lc, rel->fklist)
+	{
+		ForeignKeyInfo *fk = (ForeignKeyInfo *) lfirst(lc);
+		ListCell *ic;
+
+		/*
+		 * For ANTI Joins, when a foreign key has more than 1 referencing
+		 * column we currently only allow the join to be removed if the
+		 * foreign key has been defined with MATCH FULL. The reason for this is
+		 * that if we do manage to remove this join then we'll need to add some
+		 * quals to the joining rel to ensure the query remains equivalent. For
+		 * the case of ANTI joins, we need to ensure that we'd only show
+		 * columns that have a NULL in any of the columns defined in the
+		 * foreign key. With ANTI joins this seems a bit sloppy as we'd need to
+		 * do something like, WHERE col1 IS NULL OR col2 IS NULL.
+		 *
+		 * We allow this case for SEMI joins as we're building a WHERE clause
+		 * such as WHERE col1 IS NOT NULL AND col2 IS NOT NULL.
+		 */
+		if (fk->conncols > 1 && jointype == JOIN_ANTI &&
+			fk->confmatchtype != FKCONSTR_MATCH_FULL)
+			continue;
+
+		foreach(ic, referencedrel->indexlist)
+		{
+			IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
+			if (fk->conindid == ind->indexoid)
+			{
+				/* Sanity check? XXX Should we complain or just skip this one? */
+				if (fk->conncols != ind->ncolumns)
+					elog(ERROR, "Number of columns in foreign key does not match number of indexed columns");
+
+				/* Index not ready? XXX Perhaps this should be an error as we
+				 * should only have fks that have been validated.
+				 */
+				if (!ind->unique || !ind->immediate ||
+					(ind->indpred != NIL && !ind->predOK))
+					continue;
+
+				if (expressions_match_foreign_key(fk, ind, referencing_exprs, index_exprs, operator_list))
+					return true;
+			}
+		}
+	}
+
+	return false;
+}
+
+static bool
+expressions_match_foreign_key(ForeignKeyInfo *fk, IndexOptInfo *ind,
+			List *exprlist, List *index_exprs, List *operator_list)
+{
+	ListCell *lc;
+	ListCell *lc2;
+	ListCell *lc3;
+	int		 col;
+
+	Assert(list_length(exprlist) == list_length(index_exprs));
+	Assert(list_length(exprlist) == list_length(operator_list));
+
+	/*
+	 * For each column defined in the foreign key we must ensure that we find
+	 * a qual in the expression list which matches the foreign key on one side
+	 * of the expression and the index on the other side of the expression. It
+	 * does not matter if the same expression appears many times, we just need
+	 * to ensure all exist at least one and no extra non matching expressions
+	 * exist.
+	 */
+
+	/*
+	 * Fast path out if there's not enough conditions to match
+	 * each column in the foreign key. Note that we cannot check
+	 * that the number of expressions is equal here since it would
+	 * cause duplicate expressions to not match.
+	 */
+	if (list_length(exprlist) < fk->conncols)
+		return false;
+
+	forthree(lc, exprlist, lc2, index_exprs, lc3, operator_list)
+	{
+		Node	*expr = (Node *) lfirst(lc);
+		Node	*idxexpr = (Node *) lfirst(lc2);
+		Oid		opr = lfirst_oid(lc3);
+		bool matched = false;
+
+		/* if anything is NULL or not a var then we can it's not a match */
+		if (!expr || !IsA(expr, Var) || !idxexpr || !IsA(idxexpr, Var))
+			return false;
+
+		for (col = 0; col < fk->conncols; col++)
+		{
+			if (fk->conkey[col] == ((Var *) expr)->varattno &&
+				fk->confkey[col] == ((Var *) idxexpr)->varattno &&
+				opr == fk->conpfeqop[col])
+			{
+				matched = true;
+				break;
+			}
+		}
+
+		/*
+		 * Did we find anything matching the fk col? If not then we'll
+		 * return a no match.
+		 */
+		if (!matched)
+			return false;
+	}
+
+	return true;
+}
+
+
+/*
+ * sortclause_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in sortclause also exist in clause_list.
+ * The function will return true if clause_list is the same as or a superset
+ * of the sortclause. If the sortclause has columns that don't exist in the
+ * clause_list then the query can't be guaranteed unique on the clause_list
+ * columns.
+ *
+ * Note: The calling function must ensure that sortclause is not NIL.
+ */
+static bool
+sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortclause)
+{
+	ListCell *l;
+
+	Assert(sortclause != NIL);
+
+	/*
+	 * Loop over each sort clause to ensure that we have
+	 * an item in the join conditions that matches it.
+	 * It does not matter if we have more items in the join
+	 * condition than we have in the sort clause.
+	 */
+	foreach(l, sortclause)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * Since a constant only has 1 value the existence of one here will
+		 * not cause any duplication of the results. We'll simply ignore it!
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else /* XXX what else could it be? */
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index b2becfa..ac7b38b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -26,6 +26,8 @@
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "catalog/heap.h"
+#include "catalog/pg_constraint.h"
+#include "catalog/pg_type.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -38,6 +40,7 @@
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
 #include "storage/bufmgr.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -384,6 +387,123 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 
 	heap_close(relation, NoLock);
 
+
+	{
+		List	   *result = NIL;
+		Relation	fkeyRel;
+		Relation	fkeyRelIdx;
+		ScanKeyData fkeyScankey;
+		SysScanDesc fkeyScan;
+		HeapTuple	tuple;
+
+		/* ConstraintRelidIndexId
+		 * Must scan pg_constraint.  Right now, it is a seqscan because there is
+		 * no available index on conrelid.
+		 */
+		ScanKeyInit(&fkeyScankey,
+			Anum_pg_constraint_conrelid,
+			BTEqualStrategyNumber, F_OIDEQ,
+			ObjectIdGetDatum(relationObjectId));
+
+		fkeyRel = heap_open(ConstraintRelationId, AccessShareLock);
+		fkeyRelIdx = index_open(ConstraintRelidIndexId, AccessShareLock);
+		fkeyScan = systable_beginscan_ordered(fkeyRel, fkeyRelIdx, NULL, 1, &fkeyScankey);
+
+		while ((tuple = systable_getnext_ordered(fkeyScan, ForwardScanDirection)) != NULL)
+		{
+			Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+			ForeignKeyInfo *fkinfo;
+			Datum		adatum;
+			bool		isNull;
+			ArrayType  *arr;
+			int			numkeys;
+
+			/* Not a foreign key */
+			if (con->contype != CONSTRAINT_FOREIGN)
+				continue;
+
+			/* we're not interested unless the fk has been validated */
+			if (!con->convalidated)
+				continue;
+
+			fkinfo = (ForeignKeyInfo *) palloc(sizeof(ForeignKeyInfo));
+			fkinfo->conindid = con->conindid;
+			fkinfo->confrelid = con->confrelid;
+			fkinfo->convalidated = con->convalidated;
+			fkinfo->conrelid = con->conrelid;
+			fkinfo->confupdtype = con->confupdtype;
+			fkinfo->confdeltype = con->confdeltype;
+			fkinfo->confmatchtype = con->confmatchtype;
+
+			adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+				RelationGetDescr(fkeyRel), &isNull);
+			if (isNull)
+				elog(ERROR, "null conkey for constraint %u",
+				HeapTupleGetOid(tuple));
+			arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+			numkeys = ARR_DIMS(arr)[0];
+			if (ARR_NDIM(arr) != 1 ||
+				numkeys < 0 ||
+				ARR_HASNULL(arr) ||
+				ARR_ELEMTYPE(arr) != INT2OID)
+				elog(ERROR, "conkey is not a 1-D smallint array");
+
+			fkinfo->conkey = (int16 *) ARR_DATA_PTR(arr);
+
+			fkinfo->conncols = numkeys;
+
+			adatum = heap_getattr(tuple, Anum_pg_constraint_confkey,
+				RelationGetDescr(fkeyRel), &isNull);
+			if (isNull)
+				elog(ERROR, "null confkey for constraint %u",
+				HeapTupleGetOid(tuple));
+			arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+			numkeys = ARR_DIMS(arr)[0];
+
+			/* sanity check */
+			if (numkeys != fkinfo->conncols)
+				elog(ERROR, "number of confkey elements does not equal conkey elements");
+
+			if (ARR_NDIM(arr) != 1 ||
+				numkeys < 0 ||
+				ARR_HASNULL(arr) ||
+				ARR_ELEMTYPE(arr) != INT2OID)
+				elog(ERROR, "confkey is not a 1-D smallint array");
+
+			fkinfo->confkey = (int16 *) ARR_DATA_PTR(arr);
+
+			adatum = heap_getattr(tuple, Anum_pg_constraint_conpfeqop,
+				RelationGetDescr(fkeyRel), &isNull);
+			if (isNull)
+				elog(ERROR, "null conpfeqop for constraint %u",
+				HeapTupleGetOid(tuple));
+			arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+			numkeys = ARR_DIMS(arr)[0];
+
+			/* sanity check */
+			if (numkeys != fkinfo->conncols)
+				elog(ERROR, "number of conpfeqop elements does not equal conkey elements");
+
+			if (ARR_NDIM(arr) != 1 ||
+				numkeys < 0 ||
+				ARR_HASNULL(arr) ||
+				ARR_ELEMTYPE(arr) != OIDOID)
+				elog(ERROR, "conpfeqop is not a 1-D smallint array");
+
+			fkinfo->conpfeqop = (Oid *) ARR_DATA_PTR(arr);
+
+			result = lappend(result, fkinfo);
+		}
+
+		rel->fklist = result;
+
+		systable_endscan_ordered(fkeyScan);
+		index_close(fkeyRelIdx, AccessShareLock);
+		heap_close(fkeyRel, AccessShareLock);
+	}
+
+
+
 	/*
 	 * Allow a plugin to editorialize on the info we obtained from the
 	 * catalogs.  Actions might include altering the assumed relation size,
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index c938c27..a0fb8eb 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -115,6 +115,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	rel->lateral_relids = NULL;
 	rel->lateral_referencers = NULL;
 	rel->indexlist = NIL;
+	rel->fklist = NIL;
 	rel->pages = 0;
 	rel->tuples = 0;
 	rel->allvisfrac = 0;
@@ -377,6 +378,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->lateral_relids = NULL;
 	joinrel->lateral_referencers = NULL;
 	joinrel->indexlist = NIL;
+	joinrel->fklist = NIL;
 	joinrel->pages = 0;
 	joinrel->tuples = 0;
 	joinrel->allvisfrac = 0;
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 300136e..3deb59b 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -445,6 +445,7 @@ typedef struct RelOptInfo
 	Relids		lateral_relids; /* minimum parameterization of rel */
 	Relids		lateral_referencers;	/* rels that reference me laterally */
 	List	   *indexlist;		/* list of IndexOptInfo */
+	List	   *fklist;			/* list of ForeignKeyInfo */
 	BlockNumber pages;			/* size estimates derived from pg_class */
 	double		tuples;
 	double		allvisfrac;
@@ -1643,4 +1644,20 @@ typedef struct JoinCostWorkspace
 	int			numbatches;
 } JoinCostWorkspace;
 
+typedef struct ForeignKeyInfo
+{
+	Oid			conindid;		/* index supporting this constraint */
+	Oid			confrelid;		/* relation referenced by foreign key */
+	bool		convalidated;	/* constraint has been validated? */
+	Oid			conrelid;		/* relation this constraint constrains */
+	char		confupdtype;	/* foreign key's ON UPDATE action */
+	char		confdeltype;	/* foreign key's ON DELETE action */
+	char		confmatchtype;	/* foreign key's match type */
+	int			conncols;		/* number of columns references */
+	int16	   *conkey;			/* Columns of conrelid that the constraint applies to */
+	int16	   *confkey;		/* columns of confrelid that foreign key references */
+	Oid		   *conpfeqop;		/* Operator list for comparing PK to FK */
+} ForeignKeyInfo;
+
+
 #endif   /* RELATION_H */
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 934488a..951d13f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3060,9 +3060,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3098,7 +3100,353 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
+BEGIN;
+-- Test join removals for semi and anti joins
+CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY);
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES b(id));
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
+          QUERY PLAN          
+------------------------------
+ Seq Scan on a
+   Filter: (b_id IS NOT NULL)
+(2 rows)
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
+          QUERY PLAN          
+------------------------------
+ Seq Scan on a
+   Filter: (b_id IS NOT NULL)
+(2 rows)
+
+-- should remove anti join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE NOT EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
+        QUERY PLAN        
+--------------------------
+ Seq Scan on a
+   Filter: (b_id IS NULL)
+(2 rows)
+
+-- should remove semi join to b (swapped condition order)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id = a.b_id);
+          QUERY PLAN          
+------------------------------
+ Seq Scan on a
+   Filter: (b_id IS NOT NULL)
+(2 rows)
+
+-- should not remove semi join (since not using equals)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id >= a.b_id);
+               QUERY PLAN                
+-----------------------------------------
+ Nested Loop Semi Join
+   ->  Seq Scan on a
+   ->  Index Only Scan using b_pkey on b
+         Index Cond: (id >= a.b_id)
+(4 rows)
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id+0 IN(SELECT id FROM b);
+             QUERY PLAN             
+------------------------------------
+ Hash Semi Join
+   Hash Cond: ((a.b_id + 0) = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id+0 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Hash Semi Join
+   Hash Cond: (a.b_id = (b.id + 0))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+-- should not remove semi join (wrong column)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE id IN(SELECT id FROM b);
+         QUERY PLAN         
+----------------------------
+ Hash Join
+   Hash Cond: (b.id = a.id)
+   ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(5 rows)
+
+ROLLBACK;
+BEGIN;
+-- Semi join removal code with 2 column foreign keys
+CREATE TEMP TABLE b (id1 INT NOT NULL, id2 INT NOT NULL, PRIMARY KEY(id1,id2));
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id1 INT NOT NULL, b_id2 INT NOT NULL);
+ALTER TABLE a ADD CONSTRAINT a_b_id1_b_id2_fkey FOREIGN KEY (b_id1,b_id2) REFERENCES b(id1,id2) MATCH FULL;
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2);
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Seq Scan on a
+   Filter: ((b_id1 IS NOT NULL) AND (b_id2 IS NOT NULL))
+(2 rows)
+
+-- should remove anti join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE NOT EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2);
+                   QUERY PLAN                    
+-------------------------------------------------
+ Seq Scan on a
+   Filter: ((b_id1 IS NULL) AND (b_id2 IS NULL))
+(2 rows)
+
+-- should not remove semi join to b (extra condition)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2 AND a.b_id2 >= id2);
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Hash Semi Join
+   Hash Cond: ((a.b_id1 = b.id1) AND (a.b_id2 = b.id2))
+   Join Filter: (a.b_id2 >= b.id2)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(6 rows)
+
+-- should not remove semi join to b (wrong operator)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 > id1 AND a.b_id2 < id2);
+                        QUERY PLAN                         
+-----------------------------------------------------------
+ Nested Loop Semi Join
+   ->  Seq Scan on a
+   ->  Index Only Scan using b_pkey on b
+         Index Cond: ((id1 < a.b_id1) AND (id2 > a.b_id2))
+(4 rows)
+
+-- should not remove semi join (only checking id1)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1);
+           QUERY PLAN            
+---------------------------------
+ Hash Join
+   Hash Cond: (a.b_id1 = b.id1)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: b.id1
+               ->  Seq Scan on b
+(7 rows)
+
+-- should not remove semi join (only checking id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id2);
+           QUERY PLAN            
+---------------------------------
+ Hash Join
+   Hash Cond: (a.b_id2 = b.id2)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: b.id2
+               ->  Seq Scan on b
+(7 rows)
+
+-- should not remove semi join (checking wrong columns)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id2);
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Hash Join
+   Hash Cond: ((a.b_id2 = b.id1) AND (a.b_id1 = b.id2))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+-- should not remove semi join (no check for id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id1);
+               QUERY PLAN                
+-----------------------------------------
+ Nested Loop Semi Join
+   ->  Seq Scan on a
+         Filter: (b_id2 = b_id1)
+   ->  Index Only Scan using b_pkey on b
+         Index Cond: (id1 = a.b_id2)
+(5 rows)
+
+-- should not remove semi join (no check for b_id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id1 = id2);
+            QUERY PLAN             
+-----------------------------------
+ Hash Join
+   Hash Cond: (a.b_id1 = b.id1)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: (id1 = id2)
+(6 rows)
+
+ROLLBACK;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
 insert into parent values (1, 10), (2, 20), (3, 30);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 275cb11..f314a03 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -861,9 +861,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -878,8 +880,150 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
+BEGIN;
+
+-- Test join removals for semi and anti joins
+CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY);
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES b(id));
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
+
+-- should remove anti join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE NOT EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
+
+-- should remove semi join to b (swapped condition order)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id = a.b_id);
+
+-- should not remove semi join (since not using equals)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE id >= a.b_id);
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id+0 IN(SELECT id FROM b);
+
+-- should not remove semi join
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE b_id IN(SELECT id+0 FROM b);
+
+-- should not remove semi join (wrong column)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE id IN(SELECT id FROM b);
+
+ROLLBACK;
+
+BEGIN;
+
+-- Semi join removal code with 2 column foreign keys
+
+CREATE TEMP TABLE b (id1 INT NOT NULL, id2 INT NOT NULL, PRIMARY KEY(id1,id2));
+CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id1 INT NOT NULL, b_id2 INT NOT NULL);
+
+ALTER TABLE a ADD CONSTRAINT a_b_id1_b_id2_fkey FOREIGN KEY (b_id1,b_id2) REFERENCES b(id1,id2) MATCH FULL;
+
+-- should remove semi join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2);
+
+-- should remove anti join to b
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE NOT EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2);
+
+-- should not remove semi join to b (extra condition)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id2 = id2 AND a.b_id2 >= id2);
+
+-- should not remove semi join to b (wrong operator)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 > id1 AND a.b_id2 < id2);
+
+-- should not remove semi join (only checking id1)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1);
+
+-- should not remove semi join (only checking id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id2);
+
+-- should not remove semi join (checking wrong columns)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id2);
+
+-- should not remove semi join (no check for id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id2 = id1 AND a.b_id1 = id1);
+
+-- should not remove semi join (no check for b_id2)
+EXPLAIN (COSTS OFF)
+SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id1 = id1 AND a.b_id1 = id2);
+
+ROLLBACK;
+
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
 insert into parent values (1, 10), (2, 20), (3, 30);

#21

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#20)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

I'm not quite there with inner joins yet. I'm still getting my head around
just where the join quals are actually stored.

TBH I think that trying to do anything at all for inner joins is probably
a bad idea. The cases where the optimization could succeed are so narrow
that it's unlikely to be worth adding cycles to every query to check.

The planning cost of all this is likely to be a concern anyway; but
if you can show that you don't add anything unless there are some outer
joins in the query, you can at least overcome objections about possibly
adding significant overhead to trivial queries.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Stephen Frost

sfrost@snowman.net

over 11 years ago

In reply to: Tom Lane (#21)

Re: Allowing join removals for more join types

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I'm not quite there with inner joins yet. I'm still getting my head around
just where the join quals are actually stored.

TBH I think that trying to do anything at all for inner joins is probably
a bad idea. The cases where the optimization could succeed are so narrow
that it's unlikely to be worth adding cycles to every query to check.

I agree that we don't want to add too many cycles to trivial queries but
I don't think it's at all fair to say that FK-check joins are a narrow
use-case and avoiding that join could be a very nice win.

The planning cost of all this is likely to be a concern anyway; but
if you can show that you don't add anything unless there are some outer
joins in the query, you can at least overcome objections about possibly
adding significant overhead to trivial queries.

I'm not quite buying this. We're already beyond really trivial queries
since we're talking about joins and then considering how expensive joins
can be, putting in a bit of effort to eliminate one would be time well
worth spending, imv.

In any case, I'd certainly suggest David continue to develop this and
then we can look at measuring the cost on cases where it was time wasted
and on cases where it helps. We may also be able to come up with ways
to short-circuit the test and thereby minimize the cost in cases where
it won't help.

Thanks,

Stephen

#23

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Stephen Frost (#22)

Re: Allowing join removals for more join types

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

TBH I think that trying to do anything at all for inner joins is probably
a bad idea. The cases where the optimization could succeed are so narrow
that it's unlikely to be worth adding cycles to every query to check.

I agree that we don't want to add too many cycles to trivial queries but
I don't think it's at all fair to say that FK-check joins are a narrow
use-case and avoiding that join could be a very nice win.

[ thinks for a bit... ] OK, I'd been thinking that to avoid a join the
otherwise-unreferenced table would have to have a join column that is both
unique and the referencing side of an FK to the other table's join column.
But after consuming more caffeine I see I got that backwards and it would
need to be the *referenced* side of the FK, which is indeed a whole lot
more plausible case.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Noah Misch

noah@leadboat.com

over 11 years ago

In reply to: David Rowley (#18)

Re: Allowing join removals for more join types

On Wed, May 28, 2014 at 08:39:32PM +1200, David Rowley wrote:

The attached patch allows join removals for both sub queries with left
joins and also semi joins where a foreign key can prove the existence of
the record.

When a snapshot can see modifications that queued referential integrity
triggers for some FK constraint, that constraint is not guaranteed to hold
within the snapshot until those triggers have fired. For example, a query
running within a VOLATILE function f() in a statement like "UPDATE t SET c =
f(c)" may read data that contradicts FK constraints involving table "t".
Deferred UNIQUE constraints, which we also do not yet use for deductions in
the planner, have the same problem; see commit 0f39d50. This project will
need a design accounting for that hazard.

As a point of procedure, I recommend separating the semijoin support into its
own patch. Your patch is already not small; delaying non-essential parts will
make the essential parts more accessible to reviewers.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Noah Misch (#24)

Re: Allowing join removals for more join types

On Wed, Jun 4, 2014 at 11:50 AM, Noah Misch <noah@leadboat.com> wrote:

On Wed, May 28, 2014 at 08:39:32PM +1200, David Rowley wrote:

The attached patch allows join removals for both sub queries with left
joins and also semi joins where a foreign key can prove the existence of
the record.

When a snapshot can see modifications that queued referential integrity
triggers for some FK constraint, that constraint is not guaranteed to hold
within the snapshot until those triggers have fired. For example, a query
running within a VOLATILE function f() in a statement like "UPDATE t SET c
=
f(c)" may read data that contradicts FK constraints involving table "t".
Deferred UNIQUE constraints, which we also do not yet use for deductions in
the planner, have the same problem; see commit 0f39d50. This project will
need a design accounting for that hazard.

I remember reading about some concerns with that here:
/messages/by-id/51E2D505.5010705@2ndQuadrant.com
But I didn't quite understand the situation where the triggers are delayed.
I just imagined that the triggers would have fired by the time the command
had completed. If that's not the case, when do the triggers fire? on
commit? Right now I've no idea how to check for this in order to disable
the join removal.

For the deferred unique constraints I'm protecting against that the same
way as the left join removal does... It's in the
relation_has_foreign_key_for() function where I'm matching the foreign keys
up to the indexes on the other relation.

As a point of procedure, I recommend separating the semijoin support into

its
own patch. Your patch is already not small; delaying non-essential parts
will
make the essential parts more accessible to reviewers.

That's a good idea. I think the left join additions would be realistic to
get in early in the 9.5 cycle, but the semi and anti joins stuff I know
that I'm going to need some more advice for. It makes sense to split them
out and get what I can in sooner rather than delaying it for no good reason.

Regards

David Rowley

#26

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#25)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

On Wed, Jun 4, 2014 at 11:50 AM, Noah Misch <noah@leadboat.com> wrote:

When a snapshot can see modifications that queued referential integrity
triggers for some FK constraint, that constraint is not guaranteed to hold
within the snapshot until those triggers have fired.

I remember reading about some concerns with that here:
/messages/by-id/51E2D505.5010705@2ndQuadrant.com
But I didn't quite understand the situation where the triggers are delayed.
I just imagined that the triggers would have fired by the time the command
had completed. If that's not the case, when do the triggers fire? on
commit? Right now I've no idea how to check for this in order to disable
the join removal.

I'm afraid that this point destroys your entire project :-( ... even
without deferred constraints, there's no good way to be sure that you're
not planning a query that will be run inside some function that can see
the results of partially-completed updates. The equivalent problem for
unique indexes is tolerable because the default choice is immediate
uniqueness enforcement, but there's no similar behavior for FKs.

It's possible that we could apply the optimization only to queries that
have been issued directly by a client, but that seems rather ugly and
surprise-filled.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Noah Misch

noah@leadboat.com

over 11 years ago

In reply to: Tom Lane (#26)

Re: Allowing join removals for more join types

On Wed, Jun 04, 2014 at 10:14:42AM -0400, Tom Lane wrote:

David Rowley <dgrowleyml@gmail.com> writes:

On Wed, Jun 4, 2014 at 11:50 AM, Noah Misch <noah@leadboat.com> wrote:

When a snapshot can see modifications that queued referential integrity
triggers for some FK constraint, that constraint is not guaranteed to hold
within the snapshot until those triggers have fired.

I remember reading about some concerns with that here:
/messages/by-id/51E2D505.5010705@2ndQuadrant.com
But I didn't quite understand the situation where the triggers are delayed.
I just imagined that the triggers would have fired by the time the command
had completed. If that's not the case, when do the triggers fire? on

Normally, they fire in AfterTriggerEndQuery(), which falls at the end of a
command. The trouble arises there when commands nest. (If the constraint is
deferred, they fire just before COMMIT.)

commit? Right now I've no idea how to check for this in order to disable
the join removal.

I'm afraid that this point destroys your entire project :-( ... even
without deferred constraints, there's no good way to be sure that you're
not planning a query that will be run inside some function that can see
the results of partially-completed updates. The equivalent problem for
unique indexes is tolerable because the default choice is immediate
uniqueness enforcement, but there's no similar behavior for FKs.

Let's not give up just yet. There's considerable middle ground between
ignoring the hazard and ignoring all FK constraints in the planner, ...

It's possible that we could apply the optimization only to queries that
have been issued directly by a client, but that seems rather ugly and
surprise-filled.

... such as this idea. It's a good start to a fairly-hard problem. FKs are
also always valid when afterTriggers->query_depth == -1, such as when all
ongoing queries qualified for EXEC_FLAG_SKIP_TRIGGERS. What else? We could
teach trigger.c to efficiently report whether a given table has a queued RI
trigger. Having done that, when plancache.c is building a custom plan, the
planner could ignore FKs with pending RI checks and use the rest. At that
point, the surprises will be reasonably-isolated.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Noah Misch (#27)

Re: Allowing join removals for more join types

On 2014-06-04 20:04:07 -0400, Noah Misch wrote:

On Wed, Jun 04, 2014 at 10:14:42AM -0400, Tom Lane wrote:

It's possible that we could apply the optimization only to queries that
have been issued directly by a client, but that seems rather ugly and
surprise-filled.

... such as this idea. It's a good start to a fairly-hard problem. FKs are
also always valid when afterTriggers->query_depth == -1, such as when all
ongoing queries qualified for EXEC_FLAG_SKIP_TRIGGERS. What else? We could
teach trigger.c to efficiently report whether a given table has a queued RI
trigger. Having done that, when plancache.c is building a custom plan, the
planner could ignore FKs with pending RI checks and use the rest. At that
point, the surprises will be reasonably-isolated.

A bit more crazy, but how about trying trying to plan joins with a added
one-time qual that checks the size of the deferred trigger queue? Then
we wouldn't even need special case plans.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Noah Misch

noah@leadboat.com

over 11 years ago

In reply to: Andres Freund (#28)

Re: Allowing join removals for more join types

On Thu, Jun 05, 2014 at 02:12:33AM +0200, Andres Freund wrote:

On 2014-06-04 20:04:07 -0400, Noah Misch wrote:

On Wed, Jun 04, 2014 at 10:14:42AM -0400, Tom Lane wrote:

It's possible that we could apply the optimization only to queries that
have been issued directly by a client, but that seems rather ugly and
surprise-filled.

... such as this idea. It's a good start to a fairly-hard problem. FKs are
also always valid when afterTriggers->query_depth == -1, such as when all
ongoing queries qualified for EXEC_FLAG_SKIP_TRIGGERS. What else? We could
teach trigger.c to efficiently report whether a given table has a queued RI
trigger. Having done that, when plancache.c is building a custom plan, the
planner could ignore FKs with pending RI checks and use the rest. At that
point, the surprises will be reasonably-isolated.

A bit more crazy, but how about trying trying to plan joins with a added
one-time qual that checks the size of the deferred trigger queue? Then
we wouldn't even need special case plans.

That, too, sounds promising to investigate.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Noah Misch (#29)

Re: Allowing join removals for more join types

Noah Misch <noah@leadboat.com> writes:

On Thu, Jun 05, 2014 at 02:12:33AM +0200, Andres Freund wrote:

A bit more crazy, but how about trying trying to plan joins with a added
one-time qual that checks the size of the deferred trigger queue? Then
we wouldn't even need special case plans.

That, too, sounds promising to investigate.

Not terribly. You can't actually do join removal in such a case, so it's
not clear to me that there's much win to be had. The planner would be at
a loss as to what cost to assign such a construct, either.

Moreover, what happens if the trigger queue gets some entries after the
query starts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Noah Misch

noah@leadboat.com

over 11 years ago

In reply to: Tom Lane (#30)

Re: Allowing join removals for more join types

On Thu, Jun 05, 2014 at 07:44:31PM -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Thu, Jun 05, 2014 at 02:12:33AM +0200, Andres Freund wrote:

A bit more crazy, but how about trying trying to plan joins with a added
one-time qual that checks the size of the deferred trigger queue? Then
we wouldn't even need special case plans.

That, too, sounds promising to investigate.

Not terribly. You can't actually do join removal in such a case, so it's
not clear to me that there's much win to be had. The planner would be at
a loss as to what cost to assign such a construct, either.

Yes, those are noteworthy points against it.

Moreover, what happens if the trigger queue gets some entries after the
query starts?

Normally, the query's snapshot will hide modifications that prompted those
entries. Searching for exceptions to that rule should be part of this
development effort.

A related special case came to mind: queries running in the WHEN condition of
an AFTER ROW trigger. If the trigger in question precedes the RI triggers,
the queue will not yet evidence the triggering modification. Nonetheless,
queries in the WHEN clause will see that modification.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#30)

Re: Allowing join removals for more join types

On Fri, Jun 6, 2014 at 11:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Noah Misch <noah@leadboat.com> writes:

On Thu, Jun 05, 2014 at 02:12:33AM +0200, Andres Freund wrote:

A bit more crazy, but how about trying trying to plan joins with a added
one-time qual that checks the size of the deferred trigger queue? Then
we wouldn't even need special case plans.

That, too, sounds promising to investigate.

Not terribly. You can't actually do join removal in such a case, so it's
not clear to me that there's much win to be had. The planner would be at
a loss as to what cost to assign such a construct, either.

Moreover, what happens if the trigger queue gets some entries after the
query starts?

In the scripts below I've created a scenario (scenario 1) that the inner
query which I've put in a trigger function does see the the referenced
table before the RI triggers execute, so it gives 1 row in the SELECT j2_id
FROM j1 WHERE NOT EXISTS(SELECT 1 FROM j2 WHERE j2_id = j2.id) query. This
works and I agree it's a problem that needs looked at in the patch.

I'm also trying to create the situation that you describe where the RI
trigger queue gets added to during the query. I'm likely doing it wrong
somehow, but I can't see what I'm doing wrong.

Here's both scripts. I need help with scenario 2 to create the problem you
describe, I can't get my version to give me any stale non-cascaded records.

-- Scenario 1: Outer command causes a foreign key trigger to be queued
-- and this results in a window of time where we have records
-- in the referencing table which don't yet exist in the
-- referenced table.

DROP TABLE IF EXISTS j1;
DROP TABLE IF EXISTS j2;
DROP TABLE IF EXISTS records_violating_fkey;

CREATE TABLE j2 (id INT NOT NULL PRIMARY KEY);
CREATE TABLE j1 (
id INT PRIMARY KEY,
j2_id INT NOT NULL REFERENCES j2 (id) MATCH FULL ON DELETE CASCADE ON
UPDATE CASCADE
);

INSERT INTO j2 VALUES(10),(20);
INSERT INTO j1 VALUES(1,10),(2,20);

-- create a table to store records that 'violate' the fkey.
CREATE TABLE records_violating_fkey (j2_id INT NOT NULL);

CREATE OR REPLACE FUNCTION j1_update() RETURNS TRIGGER AS $$
BEGIN
RAISE notice 'Trigger fired';
INSERT INTO records_violating_fkey SELECT j2_id FROM j1 WHERE NOT
EXISTS(SELECT 1 FROM j2 WHERE j2_id = j2.id);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER j1_update_trigger BEFORE UPDATE ON j2 FOR EACH ROW EXECUTE
PROCEDURE j1_update();

UPDATE j2 SET id = id+1;

-- returns 1 row.
SELECT * FROM records_violating_fkey;

------------------------------------------------------------------------------
-- Scenario 2: Inner command causes a foreign key trigger to be queued.

DROP TABLE IF EXISTS j1;
DROP TABLE IF EXISTS j2;

CREATE TABLE j2 (id INT NOT NULL PRIMARY KEY);

CREATE TABLE j1 (
id INT PRIMARY KEY,
j2_id INT NOT NULL REFERENCES j2 (id) MATCH FULL ON DELETE CASCADE ON
UPDATE CASCADE
);

INSERT INTO j2 VALUES(10),(20);
INSERT INTO j1 VALUES(1,10),(2,20);

CREATE OR REPLACE FUNCTION update_j2(p_id int) RETURNS int AS $$
BEGIN
RAISE NOTICE 'Updating j2 id = % to %', p_id, p_id + 1;
UPDATE j2 SET id = id + 1 WHERE id = p_id;
RETURN 1;
END;
$$ LANGUAGE plpgsql;

-- try and get some records to be returned by causing an update on the
record that is not the current record.
SELECT * FROM j1 WHERE NOT EXISTS(SELECT 1 FROM j2 WHERE j2_id = id) AND
update_j2((SELECT MIN(j2_id) FROM j1 ij1 WHERE ij1.j2_id <> j1.j2_id)) = 1;

Regards

David Rowley

#33

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Tom Lane (#23)

Re: Allowing join removals for more join types

On Mon, Jun 2, 2014 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

TBH I think that trying to do anything at all for inner joins is probably
a bad idea. The cases where the optimization could succeed are so narrow
that it's unlikely to be worth adding cycles to every query to check.

I agree that we don't want to add too many cycles to trivial queries but
I don't think it's at all fair to say that FK-check joins are a narrow
use-case and avoiding that join could be a very nice win.

[ thinks for a bit... ] OK, I'd been thinking that to avoid a join the
otherwise-unreferenced table would have to have a join column that is both
unique and the referencing side of an FK to the other table's join column.
But after consuming more caffeine I see I got that backwards and it would
need to be the *referenced* side of the FK, which is indeed a whole lot
more plausible case.

Back when I did web development, this came up for me all the time.
I'd create a fact table with lots of id columns referencing dimension
tables, and then make a view over it that joined to all of those
tables so that it was easy, when reporting, to select whatever bits of
information were needed. But the problem was that if the report
didn't need all the columns, it still had to pay the cost of computing
them, which eventually got to be problematic. That was what inspired
me to develop the patch for LEFT JOIN removal, but to really solve the
problems I had back then, removing INNER joins as well would have been
essential. So, while I do agree that we have to keep the planning
cost under control, I'm quite positive about the general concept. I
think a lot of users will benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Noah Misch (#24)

1 attachment(s)

Re: Allowing join removals for more join types

On Wed, Jun 4, 2014 at 12:50 AM, Noah Misch <noah@leadboat.com> wrote:

As a point of procedure, I recommend separating the semijoin support into
its
own patch. Your patch is already not small; delaying non-essential parts
will
make the essential parts more accessible to reviewers.

In the attached patch I've removed all the SEMI and ANTI join removal code
and left only support for LEFT JOIN removal of sub-queries that can be
proved to be unique on the join condition by looking at the GROUP BY and
DISTINCT clause.

Example:

SELECT t1.* FROM t1 LEFT OUTER JOIN (SELECT value,COUNT(*) FROM t2 GROUP BY
value) t2 ON t1.id = t2.value;

Regards

David Rowley

Attachments:

subquery_leftjoin_removal_v1.1.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.1.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..8c5714a 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,15 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/tlist.h"
+#include "nodes/nodeFuncs.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool	sortclause_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortclause);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -147,6 +153,7 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
@@ -154,11 +161,13 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 	/*
 	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * unique indexes and left joins to a subquery where the subquery is
+	 * unique on the join condition. We can check most of these criteria
+	 * pretty trivially to avoid doing useless extra work.  But checking
+	 * whether any of the indexes are unique would require iterating over
+	 * the indexlist, so for now, if we're joining to a relation, we'll just
+	 * ensure that we have at least 1 index, it won't matter if that index
+	 * is unique at this stage, we'll check those details later.
 	 */
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
@@ -168,11 +177,30 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		if (innerrel->indexlist == NIL)
+			return false; /* no possibility of a unique index */
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * The only means we currently use to check if the subquery is unique
+		 * are the GROUP BY and DISTINCT clause. If both of these are empty
+		 * then there's no point in going any further.
+		 */
+		if (subquery->groupClause == NIL &&
+			subquery->distinctClause == NIL)
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,16 +304,137 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform which could cause duplicate values even if
+	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 *
+	 * NB: We must also not remove the join in the subquery contains a
+	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
+	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set
+		 * returning functions as these may cause the query not to be unique
+		 * on the grouping columns, as per the following example:
+		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile
+		 * functions. Doing so may remove desired side affects that calls
+		 * to the function may cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the GROUP BY expressions
+		 * have matching items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the DISTINCT column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+	}
+	/*
 	 * Some day it would be nice to check for other methods of establishing
-	 * distinctness.
+	 * distinctness.  XXX is this comment still true??
 	 */
 	return false;
 }
 
+/*
+ * sortclause_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in sortclause also exist in clause_list.
+ * The function will return true if clause_list is the same as or a superset
+ * of the sortclause. If the sortclause has columns that don't exist in the
+ * clause_list then the query can't be guaranteed unique on the clause_list
+ * columns.
+ *
+ * Note: The calling function must ensure that sortclause is not NIL.
+ */
+static bool
+sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortclause)
+{
+	ListCell *l;
+
+	Assert(sortclause != NIL);
+
+	/*
+	 * Loop over each sort clause to ensure that we have an item in the join
+	 * conditions that matches it. Note that it does not matter if we have more
+	 * items in the join condition than we have in the sort clause.
+	 */
+	foreach(l, sortclause)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * Since a constant only has 1 value the existence of one here will
+		 * not cause any duplication of the results. We'll simply ignore it!
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c62a63f..202ac0a 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3131,9 +3131,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3169,6 +3171,151 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1031f26..ba75912 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -919,9 +919,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -936,6 +938,61 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check optimization of outer join when joining a unique sub query using group by
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check optimization of outer join when joining a unique sub query using distinct
+-- and a constant expression.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- and contains a redundant join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- optimization is not possible when the group by contains a column which is not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- optimization is not possible when distinct clause contains an item that is not in the join clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- optimization is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- optimization is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- optimization is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

#35

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: David Rowley (#34)

Re: Allowing join removals for more join types

On 17 June 2014 11:04, David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, Jun 4, 2014 at 12:50 AM, Noah Misch <noah@leadboat.com> wrote:

As a point of procedure, I recommend separating the semijoin support into
its
own patch. Your patch is already not small; delaying non-essential parts
will
make the essential parts more accessible to reviewers.

In the attached patch I've removed all the SEMI and ANTI join removal code
and left only support for LEFT JOIN removal of sub-queries that can be
proved to be unique on the join condition by looking at the GROUP BY and
DISTINCT clause.

Good advice, we can come back for the others later.

Example:

SELECT t1.* FROM t1 LEFT OUTER JOIN (SELECT value,COUNT(*) FROM t2 GROUP BY
value) t2 ON t1.id = t2.value;

Looks good on initial look.

This gets optimized...

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id
GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;

does it work with transitive closure like this..

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id
GROUP BY c.id) b ON a.id = b.id AND b.dummy = 1;

i.e. c.id is not in the join, but we know from subselect that c.id =
b.id and b.id is in the join

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Simon Riggs (#35)

Re: Allowing join removals for more join types

On 22 June 2014 12:51, Simon Riggs <simon@2ndquadrant.com> wrote:

Looks good on initial look.

Tests 2 and 3 seem to test the same thing.

There are no tests which have multiple column clauselist/sortlists,
nor tests for cases where the clauselist is a superset of the
sortlist.

Test comments should refer to "join removal" rather than
"optimization" because we may forget which optimization they are there
to test.

It's not clear to me where you get the term "sortclause" from. This is
either the groupclause or distinctclause, but in the test cases you
provide this shows this has nothing at all to do with sorting since
there is neither an order by or a sorted aggregate anywhere near those
queries. Can we think of a better name that won't confuse us in the
future?

The comment "Since a constant only has 1 value the existence of one here will
+ * not cause any duplication of the results. We'll simply ignore it!"
would be better as "We can ignore constants since they have only one
value and don't affect uniqueness of results".

The comment "XXX is this comment still true??" can be removed since
its just a discussion point.

The comment beginning "Currently, we only know how to remove left..."
has rewritten a whole block of text just to add a few words in the
middle. We should rewrite the comment so it changes as few lines as
possible. Especially when that comment is going to be changed again
with your later patches. Better to have it provide a bullet point list
of things we know how to remove, so we can just add to it later.

Still looks good, other than the above.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

David Rowley

dgrowley@gmail.com

over 11 years ago

In reply to: Simon Riggs (#36)

2 attachment(s)

Re: Allowing join removals for more join types

On 23 June 2014 09:31, Simon Riggs <simon@2ndquadrant.com> wrote:

On 22 June 2014 12:51, Simon Riggs <simon@2ndquadrant.com> wrote:

Looks good on initial look.

Tests 2 and 3 seem to test the same thing.

Ok, I've removed test 2 and kept test 3 which is the DISTINCT a+b test.

There are no tests which have multiple column clauselist/sortlists,
nor tests for cases where the clauselist is a superset of the
sortlist.

I've added:
SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id)
b ON a.b_id = b.id AND a.id = b.c_id
but I'm half temped to just add 2 new tables that allow this to be done in
a more sensible way, since c_id is really intended to reference c.id in the
defined tables.

I've also added one where the join condition is a superset of the GROUP BY
clause. I had indented the one with the constant to be this, but probably,
you're right, it should be an actual column since constants are treated
differently.

Test comments should refer to "join removal" rather than
"optimization" because we may forget which optimization they are there
to test.

Good idea...Fixed.

It's not clear to me where you get the term "sortclause" from. This is
either the groupclause or distinctclause, but in the test cases you
provide this shows this has nothing at all to do with sorting since
there is neither an order by or a sorted aggregate anywhere near those
queries. Can we think of a better name that won't confuse us in the
future?

I probably got the word "sort" from the function targetIsInSortList, which
expects a list of SortGroupClause. I've renamed the function to
sortlist_is_unique_on_restrictinfo() and renamed the sortclause parameter
to sortlist. Hopefully will reduce confusion about it being an ORDER BY
clause a bit more. I think sortgroupclauselist might be just a bit too
long. What do you think?

The comment "Since a constant only has 1 value the existence of one here
will
+ * not cause any duplication of the results. We'll simply ignore it!"
would be better as "We can ignore constants since they have only one
value and don't affect uniqueness of results".

Ok, changed.

The comment "XXX is this comment still true??" can be removed since
its just a discussion point.

Removed.

The comment beginning "Currently, we only know how to remove left..."
has rewritten a whole block of text just to add a few words in the
middle. We should rewrite the comment so it changes as few lines as
possible. Especially when that comment is going to be changed again
with your later patches. Better to have it provide a bullet point list
of things we know how to remove, so we can just add to it later.

I had thought that I'd put the code for other join types in their own
functions as not all will have a SpecialJoinInfo. In the patch that
contained ANTI and SEMI join support I had renamed the function
join_is_removable() to leftjoin_is_removable() and added a new function for
semi/anti joins.

I've done a re-factor of this comment, although it likely would still need
some small updates around the part where it talks about "left join" later
when I start working on support for other join types. The previous comment
was giving some clarification about returning early when there's no indexes
on the relation, I decided to move this out of that comment and just
include a more general note at the bottom but also add some more detail
about why we're fast pathing out when indexlist is empty.

Still looks good, other than the above.

Great. Thanks for reviewing!

I've attached an updated patch and also a delta patch of the changes I've
made since the last version.

Regards

David Rowley

Show quoted text

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

subquery_leftjoin_removal_v1.2.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.2.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..ea4a9e0 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,15 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/tlist.h"
+#include "nodes/nodeFuncs.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool	sortlist_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortlist);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -147,19 +153,33 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
 
 	/*
-	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * Assuming none of the variables from the join are needed by the query,
+	 * it is possible here to remove a left join providing we can determine
+	 * that the join will never produce more than 1 row that matches the join
+	 * condition.
+	 *
+	 * There are a few ways that we can do this:
+	 *
+	 * 1. When joining to a baserel we can check if a unique index exists
+	 *    where all of the columns of the index are seen in the join condition
+	 *    with equality operators.
+	 *
+	 * 2. When joining to a subquery we can check if the subquery contains a
+	 *    GROUP BY or DISTINCT clause where all of the columns of the clause
+	 *    appear in the join condition with equality operators.
+	 *
+	 * The code below is written with the assumption that join removal is more
+	 * likely not to happen, for this reason there are fast paths for both of
+	 * the cases above to try to save on unnecessary processing.
 	 */
+
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
@@ -168,11 +188,34 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		/*
+		 * If there are no indexes then there's certainly no unique indexes
+		 * so there's no need to go any further.
+		 */
+		if (innerrel->indexlist == NIL)
+			return false;
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * The only means we currently use to check if the subquery is unique
+		 * are the GROUP BY and DISTINCT clause. If both of these are empty
+		 * then there's no point in going any further.
+		 */
+		if (subquery->groupClause == NIL &&
+			subquery->distinctClause == NIL)
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,16 +319,137 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform which could cause duplicate values even if
+	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 *
+	 * NB: We must also not remove the join in the subquery contains a
+	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
+	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set
+		 * returning functions as these may cause the query not to be unique
+		 * on the grouping columns, as per the following example:
+		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile
+		 * functions. Doing so may remove desired side affects that calls
+		 * to the function may cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the GROUP BY expressions
+		 * have matching items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			sortlist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the DISTINCT column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			sortlist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+	}
+
+	/*
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * sortlist_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in sortlist also exist in clause_list.
+ * The function will return true if clause_list is the same as or a superset
+ * of sortlist. If the sortlist has Vars that don't exist in the clause_list
+ * then the query can't be guaranteed unique on the clause_list columns.
+ *
+ * Note: The calling function must ensure that sortlist is not NIL.
+ */
+static bool
+sortlist_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortlist)
+{
+	ListCell *l;
+
+	Assert(sortlist != NIL);
+
+	/*
+	 * Loop over each sortlist item to ensure that we have an item in the
+	 * clause_list that matches it. Note that it does not matter if we have
+	 * more items in the clause_list than we have in the sortlist.
+	 */
+	foreach(l, sortlist)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * We can ignore constants since they have only one value and don't
+		 * affect uniqueness of results.
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c62a63f..4959e5f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3131,9 +3131,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3169,6 +3171,161 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join. 
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1031f26..21e29d2 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -919,9 +919,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -936,6 +938,71 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

subquery_leftjoin_removal_v1.2_delta.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.2_delta.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 8c5714a..ea4a9e0 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -34,8 +34,8 @@
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
-static bool	sortclause_is_unique_on_restrictinfo(Query *query,
-					  List *clause_list, List *sortclause);
+static bool	sortlist_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortlist);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -160,15 +160,26 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	int			attroff;
 
 	/*
-	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes and left joins to a subquery where the subquery is
-	 * unique on the join condition. We can check most of these criteria
-	 * pretty trivially to avoid doing useless extra work.  But checking
-	 * whether any of the indexes are unique would require iterating over
-	 * the indexlist, so for now, if we're joining to a relation, we'll just
-	 * ensure that we have at least 1 index, it won't matter if that index
-	 * is unique at this stage, we'll check those details later.
+	 * Assuming none of the variables from the join are needed by the query,
+	 * it is possible here to remove a left join providing we can determine
+	 * that the join will never produce more than 1 row that matches the join
+	 * condition.
+	 *
+	 * There are a few ways that we can do this:
+	 *
+	 * 1. When joining to a baserel we can check if a unique index exists
+	 *    where all of the columns of the index are seen in the join condition
+	 *    with equality operators.
+	 *
+	 * 2. When joining to a subquery we can check if the subquery contains a
+	 *    GROUP BY or DISTINCT clause where all of the columns of the clause
+	 *    appear in the join condition with equality operators.
+	 *
+	 * The code below is written with the assumption that join removal is more
+	 * likely not to happen, for this reason there are fast paths for both of
+	 * the cases above to try to save on unnecessary processing.
 	 */
+
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
@@ -182,8 +193,12 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 
 	if (innerrel->rtekind == RTE_RELATION)
 	{
+		/*
+		 * If there are no indexes then there's certainly no unique indexes
+		 * so there's no need to go any further.
+		 */
 		if (innerrel->indexlist == NIL)
-			return false; /* no possibility of a unique index */
+			return false;
 	}
 	else if (innerrel->rtekind == RTE_SUBQUERY)
 	{
@@ -348,7 +363,7 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		 * have matching items in the join condition.
 		 */
 		if (subquery->groupClause != NIL &&
-			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			sortlist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
 			return true;
 
 		/*
@@ -356,40 +371,40 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		 * items in the join condition.
 		 */
 		if (subquery->distinctClause != NIL &&
-			sortclause_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			sortlist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
 			return true;
 	}
+
 	/*
 	 * Some day it would be nice to check for other methods of establishing
-	 * distinctness.  XXX is this comment still true??
+	 * distinctness.
 	 */
 	return false;
 }
 
 /*
- * sortclause_is_unique_on_restrictinfo
+ * sortlist_is_unique_on_restrictinfo
  *
- * Checks to see if all items in sortclause also exist in clause_list.
+ * Checks to see if all items in sortlist also exist in clause_list.
  * The function will return true if clause_list is the same as or a superset
- * of the sortclause. If the sortclause has columns that don't exist in the
- * clause_list then the query can't be guaranteed unique on the clause_list
- * columns.
+ * of sortlist. If the sortlist has Vars that don't exist in the clause_list
+ * then the query can't be guaranteed unique on the clause_list columns.
  *
- * Note: The calling function must ensure that sortclause is not NIL.
+ * Note: The calling function must ensure that sortlist is not NIL.
  */
 static bool
-sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortclause)
+sortlist_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortlist)
 {
 	ListCell *l;
 
-	Assert(sortclause != NIL);
+	Assert(sortlist != NIL);
 
 	/*
-	 * Loop over each sort clause to ensure that we have an item in the join
-	 * conditions that matches it. Note that it does not matter if we have more
-	 * items in the join condition than we have in the sort clause.
+	 * Loop over each sortlist item to ensure that we have an item in the
+	 * clause_list that matches it. Note that it does not matter if we have
+	 * more items in the clause_list than we have in the sortlist.
 	 */
-	foreach(l, sortclause)
+	foreach(l, sortlist)
 	{
 		ListCell		*ri;
 		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
@@ -400,8 +415,8 @@ sortclause_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sort
 		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
 
 		/*
-		 * Since a constant only has 1 value the existence of one here will
-		 * not cause any duplication of the results. We'll simply ignore it!
+		 * We can ignore constants since they have only one value and don't
+		 * affect uniqueness of results.
 		 */
 		if (IsA(sortTarget->expr, Const))
 			continue;
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 202ac0a..4959e5f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3171,23 +3171,37 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
--- check optimization of outer join when joining a unique sub query using group by
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
 EXPLAIN (COSTS OFF)
-SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
   QUERY PLAN   
 ---------------
  Seq Scan on a
 (1 row)
 
--- check optimization of outer join when joining a unique sub query using distinct
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
 EXPLAIN (COSTS OFF)
-SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
   QUERY PLAN   
 ---------------
  Seq Scan on a
 (1 row)
 
--- check optimization of outer join when joining a unique sub query using distinct
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
   QUERY PLAN   
@@ -3195,7 +3209,7 @@ SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab
  Seq Scan on a
 (1 row)
 
--- optimization is not possible when distinct contains a volatile function
+-- join removal is not possible when distinct contains a volatile function
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
                                         QUERY PLAN                                        
@@ -3209,8 +3223,8 @@ SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a
                ->  Seq Scan on d
 (7 rows)
 
--- check optimization of outer join when joining a unique sub query using distinct
--- and a constant expression.
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
   QUERY PLAN   
@@ -3218,15 +3232,8 @@ SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.i
  Seq Scan on a
 (1 row)
 
--- check optimization of outer join when joining a unique sub query which contains 2 tables
-EXPLAIN (COSTS OFF)
-SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
-  QUERY PLAN   
----------------
- Seq Scan on a
-(1 row)
-
--- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join. 
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
   QUERY PLAN   
@@ -3234,8 +3241,9 @@ SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GRO
  Seq Scan on a
 (1 row)
 
--- check optimization of outer join when joining a unique sub query which contains 2 tables
--- and contains a redundant join clause
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a
 LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
@@ -3244,7 +3252,8 @@ LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.
  Seq Scan on a
 (1 row)
 
--- optimization is not possible when the group by contains a column which is not in the join condition.
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
            QUERY PLAN            
@@ -3258,7 +3267,8 @@ SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id
          ->  Seq Scan on a
 (7 rows)
 
--- optimization is not possible when distinct clause contains an item that is not in the join clause
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
            QUERY PLAN            
@@ -3272,7 +3282,7 @@ SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id
          ->  Seq Scan on a
 (7 rows)
 
--- optimization is not possible when distinct contains a volatile function
+-- join removal is not possible when DISTINCT contains a volatile function
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
                    QUERY PLAN                    
@@ -3288,7 +3298,7 @@ SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.
                      ->  Seq Scan on b b_1
 (9 rows)
 
--- optimization is not possible when there are any volatile functions in the target list.
+-- join removal is not possible when there are any volatile functions in the target list.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
          QUERY PLAN         
@@ -3302,7 +3312,7 @@ SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY
          ->  Seq Scan on a
 (7 rows)
 
--- optimization is not possible when there are set returning functions in the target list.
+-- join removal is not possible when there are set returning functions in the target list.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
          QUERY PLAN         
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index ba75912..21e29d2 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -938,58 +938,68 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
--- check optimization of outer join when joining a unique sub query using group by
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
 EXPLAIN (COSTS OFF)
-SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.id = b.id;
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
 
--- check optimization of outer join when joining a unique sub query using distinct
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
 EXPLAIN (COSTS OFF)
-SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id FROM b) b ON a.id = b.c_id;
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
 
--- check optimization of outer join when joining a unique sub query using distinct
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
 
--- optimization is not possible when distinct contains a volatile function
+-- join removal is not possible when distinct contains a volatile function
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
 
--- check optimization of outer join when joining a unique sub query using distinct
--- and a constant expression.
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
 
--- check optimization of outer join when joining a unique sub query which contains 2 tables
-EXPLAIN (COSTS OFF)
-SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id;
-
--- check optimization of outer join when joining a unique sub query which contains 2 tables
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
 
--- check optimization of outer join when joining a unique sub query which contains 2 tables
--- and contains a redundant join clause
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a
 LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
 
--- optimization is not possible when the group by contains a column which is not in the join condition.
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
 
--- optimization is not possible when distinct clause contains an item that is not in the join clause
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
 
--- optimization is not possible when distinct contains a volatile function
+-- join removal is not possible when DISTINCT contains a volatile function
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
 
--- optimization is not possible when there are any volatile functions in the target list.
+-- join removal is not possible when there are any volatile functions in the target list.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
 
--- optimization is not possible when there are set returning functions in the target list.
+-- join removal is not possible when there are set returning functions in the target list.
 EXPLAIN (COSTS OFF)
 SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;

#38

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: David Rowley (#37)

Re: Allowing join removals for more join types

On 23 June 2014 12:06, David Rowley <dgrowley@gmail.com> wrote:

It's not clear to me where you get the term "sortclause" from. This is
either the groupclause or distinctclause, but in the test cases you
provide this shows this has nothing at all to do with sorting since
there is neither an order by or a sorted aggregate anywhere near those
queries. Can we think of a better name that won't confuse us in the
future?

I probably got the word "sort" from the function targetIsInSortList, which
expects a list of SortGroupClause. I've renamed the function to
sortlist_is_unique_on_restrictinfo() and renamed the sortclause parameter to
sortlist. Hopefully will reduce confusion about it being an ORDER BY clause
a bit more. I think sortgroupclauselist might be just a bit too long. What
do you think?

OK, perhaps I should be clearer. The word "sort" here seems completely
misplaced and we should be using a more accurately descriptive term.
It's slightly more than editing to rename things like that, so I'd
prefer you cam up with a better name.

Did you comment on the transitive closure question? Should we add a
test for that, whether or not it works yet?

Other than that it looks pretty good to commit, so I'll wait a week
for other objections then commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Simon Riggs (#38)

Re: Allowing join removals for more join types

Simon Riggs <simon@2ndQuadrant.com> writes:

Other than that it looks pretty good to commit, so I'll wait a week
for other objections then commit.

I'd like to review this before it goes in. I've been waiting for it to
get marked "ready for committer" though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Tom Lane (#39)

Re: Allowing join removals for more join types

On 24 June 2014 23:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Other than that it looks pretty good to commit, so I'll wait a week
for other objections then commit.

I'd like to review this before it goes in. I've been waiting for it to
get marked "ready for committer" though.

I'll leave it for you then once I'm happy.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Simon Riggs (#35)

Re: Allowing join removals for more join types

On Sun, Jun 22, 2014 at 11:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 June 2014 11:04, David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, Jun 4, 2014 at 12:50 AM, Noah Misch <noah@leadboat.com> wrote:

As a point of procedure, I recommend separating the semijoin support

into

its
own patch. Your patch is already not small; delaying non-essential

parts

will
make the essential parts more accessible to reviewers.

In the attached patch I've removed all the SEMI and ANTI join removal

code

and left only support for LEFT JOIN removal of sub-queries that can be
proved to be unique on the join condition by looking at the GROUP BY and
DISTINCT clause.

Good advice, we can come back for the others later.

Example:

SELECT t1.* FROM t1 LEFT OUTER JOIN (SELECT value,COUNT(*) FROM t2 GROUP

BY

value) t2 ON t1.id = t2.value;

Looks good on initial look.

This gets optimized...

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id
GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;

does it work with transitive closure like this..

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id
GROUP BY c.id) b ON a.id = b.id AND b.dummy = 1;

i.e. c.id is not in the join, but we know from subselect that c.id =
b.id and b.id is in the join

Well, there's no code that looks at equivalence of the columns in the
query, but I'm not quite sure if there would have to be or not as I can't
quite think of a way to write that query in a valid way that would cause it
not to remove the join.

The example query will fail with: ERROR: column "b.id" must appear in the
GROUP BY clause or be used in an aggregate function

And if we rewrite it to use c.id in the target list

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT c.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id
GROUP BY c.id) b ON a.id = b.id AND b.dummy = 1;

With this one c.id becomes b.id, since we've given the subquery the alias
'b', so I don't think there's case here to optimise anything else.

Regards

David Rowley

#42

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Simon Riggs (#38)

2 attachment(s)

Re: Allowing join removals for more join types

On Wed, Jun 25, 2014 at 10:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 23 June 2014 12:06, David Rowley <dgrowley@gmail.com> wrote:

It's not clear to me where you get the term "sortclause" from. This is
either the groupclause or distinctclause, but in the test cases you
provide this shows this has nothing at all to do with sorting since
there is neither an order by or a sorted aggregate anywhere near those
queries. Can we think of a better name that won't confuse us in the
future?

I probably got the word "sort" from the function targetIsInSortList,

which

expects a list of SortGroupClause. I've renamed the function to
sortlist_is_unique_on_restrictinfo() and renamed the sortclause

parameter to

sortlist. Hopefully will reduce confusion about it being an ORDER BY

clause

a bit more. I think sortgroupclauselist might be just a bit too long.

What

do you think?

OK, perhaps I should be clearer. The word "sort" here seems completely
misplaced and we should be using a more accurately descriptive term.
It's slightly more than editing to rename things like that, so I'd
prefer you cam up with a better name.

hmm, I do see what you mean and understand the concern, but I was a bit
stuck on the fact it is a list of SortGroupClause after all. After a bit of
looking around the source I found a function called grouping_is_sortable
which seems to be getting given ->groupClause and ->distinctClause in a few
places around the source. I've ended up naming the
function groupinglist_is_unique_on_restrictinfo, but I can drop the word
"list" off of that if that's any better? I did have it named
clauselist_is_unique_on_restictinfo for a few minutes, but then I noticed
that this was not very suitable since the calling function uses the
variable name clause_list for the restrictinfo list, which made it even
more confusing.

Attached is a delta patch between version 1.2 and 1.3, and also a
completely updated patch.

Did you comment on the transitive closure question? Should we add a
test for that, whether or not it works yet?

In my previous email.

I could change the the following to use c.id in the targetlist and group by
clause, but I'm not really sure it's testing anything new or different.

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP
BY b.id) b ON a.id = b.id AND b.dummy = 1;

Regards

David Rowley

Show quoted text

Other than that it looks pretty good to commit, so I'll wait a week
for other objections then commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

subquery_leftjoin_removal_v1.3.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.3.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..34f41d7 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,15 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/tlist.h"
+#include "nodes/nodeFuncs.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool	groupinglist_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortlist);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -147,19 +153,33 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
 
 	/*
-	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * Assuming none of the variables from the join are needed by the query,
+	 * it is possible here to remove a left join providing we can determine
+	 * that the join will never produce more than 1 row that matches the join
+	 * condition.
+	 *
+	 * There are a few ways that we can do this:
+	 *
+	 * 1. When joining to a baserel we can check if a unique index exists
+	 *    where all of the columns of the index are seen in the join condition
+	 *    with equality operators.
+	 *
+	 * 2. When joining to a subquery we can check if the subquery contains a
+	 *    GROUP BY or DISTINCT clause where all of the columns of the clause
+	 *    appear in the join condition with equality operators.
+	 *
+	 * The code below is written with the assumption that join removal is more
+	 * likely not to happen, for this reason there are fast paths for both of
+	 * the cases above to try to save on unnecessary processing.
 	 */
+
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
@@ -168,11 +188,34 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		/*
+		 * If there are no indexes then there's certainly no unique indexes
+		 * so there's no need to go any further.
+		 */
+		if (innerrel->indexlist == NIL)
+			return false;
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * The only means we currently use to check if the subquery is unique
+		 * are the GROUP BY and DISTINCT clause. If both of these are empty
+		 * then there's no point in going any further.
+		 */
+		if (subquery->groupClause == NIL &&
+			subquery->distinctClause == NIL)
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,16 +319,137 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform which could cause duplicate values even if
+	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 *
+	 * NB: We must also not remove the join in the subquery contains a
+	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
+	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set
+		 * returning functions as these may cause the query not to be unique
+		 * on the grouping columns, as per the following example:
+		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile
+		 * functions. Doing so may remove desired side affects that calls
+		 * to the function may cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the GROUP BY expressions
+		 * have matching items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the DISTINCT column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+	}
+
+	/*
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * groupinglist_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in groupinglist also exist in rinfolist.
+ * The function will return true if rinfolist is the same as or a superset
+ * of groupinglist. If the groupinglist has Vars that don't exist in the rinfolist
+ * then the query can't be guaranteed unique on the rinfolist columns.
+ *
+ * Note: The calling function must ensure that groupinglist is not NIL.
+ */
+static bool
+groupinglist_is_unique_on_restrictinfo(Query *query, List *rinfolist, List *groupinglist)
+{
+	ListCell *l;
+
+	Assert(groupinglist != NIL);
+
+	/*
+	 * Loop over each groupinglist item to ensure that we have an item in the
+	 * rinfolist that matches it. Note that it does not matter if we have
+	 * more items in the rinfolist than we have in the groupinglist.
+	 */
+	foreach(l, groupinglist)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * We can ignore constants since they have only one value and don't
+		 * affect uniqueness of results.
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, rinfolist)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c62a63f..4959e5f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3131,9 +3131,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3169,6 +3171,161 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join. 
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1031f26..21e29d2 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -919,9 +919,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -936,6 +938,71 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

subquery_leftjoin_removal_v1.3_delta.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.3_delta.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index ea4a9e0..34f41d7 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -34,7 +34,7 @@
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
-static bool	sortlist_is_unique_on_restrictinfo(Query *query,
+static bool	groupinglist_is_unique_on_restrictinfo(Query *query,
 					  List *clause_list, List *sortlist);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
@@ -363,7 +363,7 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		 * have matching items in the join condition.
 		 */
 		if (subquery->groupClause != NIL &&
-			sortlist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
 			return true;
 
 		/*
@@ -371,7 +371,7 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		 * items in the join condition.
 		 */
 		if (subquery->distinctClause != NIL &&
-			sortlist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
 			return true;
 	}
 
@@ -383,28 +383,28 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 }
 
 /*
- * sortlist_is_unique_on_restrictinfo
+ * groupinglist_is_unique_on_restrictinfo
  *
- * Checks to see if all items in sortlist also exist in clause_list.
- * The function will return true if clause_list is the same as or a superset
- * of sortlist. If the sortlist has Vars that don't exist in the clause_list
- * then the query can't be guaranteed unique on the clause_list columns.
+ * Checks to see if all items in groupinglist also exist in rinfolist.
+ * The function will return true if rinfolist is the same as or a superset
+ * of groupinglist. If the groupinglist has Vars that don't exist in the rinfolist
+ * then the query can't be guaranteed unique on the rinfolist columns.
  *
- * Note: The calling function must ensure that sortlist is not NIL.
+ * Note: The calling function must ensure that groupinglist is not NIL.
  */
 static bool
-sortlist_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortlist)
+groupinglist_is_unique_on_restrictinfo(Query *query, List *rinfolist, List *groupinglist)
 {
 	ListCell *l;
 
-	Assert(sortlist != NIL);
+	Assert(groupinglist != NIL);
 
 	/*
-	 * Loop over each sortlist item to ensure that we have an item in the
-	 * clause_list that matches it. Note that it does not matter if we have
-	 * more items in the clause_list than we have in the sortlist.
+	 * Loop over each groupinglist item to ensure that we have an item in the
+	 * rinfolist that matches it. Note that it does not matter if we have
+	 * more items in the rinfolist than we have in the groupinglist.
 	 */
-	foreach(l, sortlist)
+	foreach(l, groupinglist)
 	{
 		ListCell		*ri;
 		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
@@ -421,7 +421,7 @@ sortlist_is_unique_on_restrictinfo(Query *query, List *clause_list, List *sortli
 		if (IsA(sortTarget->expr, Const))
 			continue;
 
-		foreach(ri, clause_list)
+		foreach(ri, rinfolist)
 		{
 			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
 			Node	   *rexpr;

#43

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: David Rowley (#42)

Re: Allowing join removals for more join types

On 26 June 2014 10:01, David Rowley <dgrowleyml@gmail.com> wrote:

Did you comment on the transitive closure question? Should we add a
test for that, whether or not it works yet?

In my previous email.

I could change the the following to use c.id in the targetlist and group by
clause, but I'm not really sure it's testing anything new or different.

EXPLAIN (COSTS OFF)
SELECT a.id FROM a
LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP
BY b.id) b ON a.id = b.id AND b.dummy = 1;

OK, agreed, no need to include.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#42)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

Attached is a delta patch between version 1.2 and 1.3, and also a
completely updated patch.

Just to note that I've started looking at this, and I've detected a rather
significant omission: there's no check that the join operator has anything
to do with the subquery's grouping operator. I think we need to verify
that they are members of the same opclass, as
relation_has_unique_index_for does.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

David Rowley

dgrowley@gmail.com

over 11 years ago

In reply to: Tom Lane (#44)

2 attachment(s)

Re: Allowing join removals for more join types

On 6 July 2014 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

Attached is a delta patch between version 1.2 and 1.3, and also a
completely updated patch.

Just to note that I've started looking at this, and I've detected a rather
significant omission: there's no check that the join operator has anything
to do with the subquery's grouping operator. I think we need to verify
that they are members of the same opclass, as
relation_has_unique_index_for does.

hmm, good point. If I understand this correctly we can just ensure that the
same operator is used for both the grouping and the join condition.

I've attached a small delta patch which fixes this, and also attached the
full updated patch.

Regards

David Rowley

Attachments:

subquery_leftjoin_removal_v1.4.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.4.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..1cdb311 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -27,9 +27,15 @@
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "optimizer/tlist.h"
+#include "nodes/nodeFuncs.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
+static bool	groupinglist_is_unique_on_restrictinfo(Query *query,
+					  List *clause_list, List *sortlist);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -147,19 +153,33 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
 
 	/*
-	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * Assuming none of the variables from the join are needed by the query,
+	 * it is possible here to remove a left join providing we can determine
+	 * that the join will never produce more than 1 row that matches the join
+	 * condition.
+	 *
+	 * There are a few ways that we can do this:
+	 *
+	 * 1. When joining to a baserel we can check if a unique index exists
+	 *    where all of the columns of the index are seen in the join condition
+	 *    with equality operators.
+	 *
+	 * 2. When joining to a subquery we can check if the subquery contains a
+	 *    GROUP BY or DISTINCT clause where all of the columns of the clause
+	 *    appear in the join condition with equality operators.
+	 *
+	 * The code below is written with the assumption that join removal is more
+	 * likely not to happen, for this reason there are fast paths for both of
+	 * the cases above to try to save on unnecessary processing.
 	 */
+
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
@@ -168,11 +188,34 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		/*
+		 * If there are no indexes then there's certainly no unique indexes
+		 * so there's no need to go any further.
+		 */
+		if (innerrel->indexlist == NIL)
+			return false;
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * The only means we currently use to check if the subquery is unique
+		 * are the GROUP BY and DISTINCT clause. If both of these are empty
+		 * then there's no point in going any further.
+		 */
+		if (subquery->groupClause == NIL &&
+			subquery->distinctClause == NIL)
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,16 +319,140 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * We can be certain that the sub query contains no duplicate values for
+	 * the join clause if item in the sub query's GROUP BY clause is also used
+	 * in the join clause using equality. This works the same way for the
+	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
+	 * clauses as these just restrict the results more and could not be the
+	 * cause of duplication in the result set. However there are a number of
+	 * pre-checks we must perform which could cause duplicate values even if
+	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 *
+	 * NB: We must also not remove the join in the subquery contains a
+	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
+	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * We cannot remove the subquery if the target list contains any set
+		 * returning functions as these may cause the query not to be unique
+		 * on the grouping columns, as per the following example:
+		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
+		 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * Don't remove the join if the target list contains any volatile
+		 * functions. Doing so may remove desired side affects that calls
+		 * to the function may cause.
+		 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/*
+		 * It should be safe to remove the join if all the GROUP BY expressions
+		 * have matching items in the join condition.
+		 */
+		if (subquery->groupClause != NIL &&
+			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
+			return true;
+
+		/*
+		 * It should be safe to remove the join if all the DISTINCT column list have matching
+		 * items in the join condition.
+		 */
+		if (subquery->distinctClause != NIL &&
+			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
+			return true;
+	}
+
+	/*
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
+/*
+ * groupinglist_is_unique_on_restrictinfo
+ *
+ * Checks to see if all items in groupinglist also exist in rinfolist.
+ * The function will return true if rinfolist is the same as or a superset
+ * of groupinglist. If the groupinglist has Vars that don't exist in the rinfolist
+ * then the query can't be guaranteed unique on the rinfolist columns.
+ *
+ * Note: The calling function must ensure that groupinglist is not NIL.
+ */
+static bool
+groupinglist_is_unique_on_restrictinfo(Query *query, List *rinfolist, List *groupinglist)
+{
+	ListCell *l;
+
+	Assert(groupinglist != NIL);
+
+	/*
+	 * Loop over each groupinglist item to ensure that we have restrictinfo
+	 * item to match. We also need to ensure that the operators used in the
+	 * groupinglist matches that of the one in the restrict info.
+	 * Note that it does not matter if we have more items in the rinfolist than
+	 * we have in the groupinglist.
+	 */
+	foreach(l, groupinglist)
+	{
+		ListCell		*ri;
+		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
+		TargetEntry		*sortTarget;
+		bool			 matched = false;
+
+		/* lookup the target list entry for the current sort sort group ref */
+		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+
+		/*
+		 * We can ignore constants since they have only one value and don't
+		 * affect uniqueness of results.
+		 */
+		if (IsA(sortTarget->expr, Const))
+			continue;
+
+		foreach(ri, rinfolist)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
+			Node	   *rexpr;
+
+			if (rinfo->outer_is_left)
+				rexpr = get_rightop(rinfo->clause);
+			else
+				rexpr = get_leftop(rinfo->clause);
+
+			if (IsA(rexpr, Var))
+			{
+				Var *var = (Var *)rexpr;
+
+				if (var->varattno == sortTarget->resno &&
+					scl->eqop == rinfo->hashjoinoperator)
+				{
+					matched = true;
+					break; /* match found */
+				}
+			}
+			else
+				return false;
+		}
+
+		if (!matched)
+			return false;
+	}
+	return true;
+}
 
 /*
  * Remove the target relid from the planner's data structures, having
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c62a63f..4959e5f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3131,9 +3131,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3169,6 +3171,161 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join. 
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1031f26..21e29d2 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -919,9 +919,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -936,6 +938,71 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

subquery_leftjoin_removal_v1.4_delta.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.4_delta.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 34f41d7..1cdb311 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -400,9 +400,11 @@ groupinglist_is_unique_on_restrictinfo(Query *query, List *rinfolist, List *grou
 	Assert(groupinglist != NIL);
 
 	/*
-	 * Loop over each groupinglist item to ensure that we have an item in the
-	 * rinfolist that matches it. Note that it does not matter if we have
-	 * more items in the rinfolist than we have in the groupinglist.
+	 * Loop over each groupinglist item to ensure that we have restrictinfo
+	 * item to match. We also need to ensure that the operators used in the
+	 * groupinglist matches that of the one in the restrict info.
+	 * Note that it does not matter if we have more items in the rinfolist than
+	 * we have in the groupinglist.
 	 */
 	foreach(l, groupinglist)
 	{
@@ -435,7 +437,8 @@ groupinglist_is_unique_on_restrictinfo(Query *query, List *rinfolist, List *grou
 			{
 				Var *var = (Var *)rexpr;
 
-				if (var->varattno == sortTarget->resno)
+				if (var->varattno == sortTarget->resno &&
+					scl->eqop == rinfo->hashjoinoperator)
 				{
 					matched = true;
 					break; /* match found */

#46

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#45)

Re: Allowing join removals for more join types

David Rowley <dgrowley@gmail.com> writes:

On 6 July 2014 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Just to note that I've started looking at this, and I've detected a rather
significant omission: there's no check that the join operator has anything
to do with the subquery's grouping operator.

hmm, good point. If I understand this correctly we can just ensure that the
same operator is used for both the grouping and the join condition.

Well, that's sort of the zero-order solution, but it doesn't work if the
join operators are cross-type.

I poked around to see if we didn't have some code already for that, and
soon found that not only do we have such code (equality_ops_are_compatible)
but actually almost this entire patch duplicates logic that already exists
in optimizer/util/pathnode.c, to wit create_unique_path's subroutines
query_is_distinct_for et al. So I'm thinking what this needs to turn into
is an exercise in refactoring to allow that logic to be used for both
purposes.

I notice that create_unique_path is not paying attention to the question
of whether the subselect's tlist contains SRFs or volatile functions.
It's possible that that's a pre-existing bug.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#46)

Re: Allowing join removals for more join types

On Mon, Jul 7, 2014 at 4:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowley@gmail.com> writes:

On 6 July 2014 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Just to note that I've started looking at this, and I've detected a

rather

significant omission: there's no check that the join operator has

anything

to do with the subquery's grouping operator.

hmm, good point. If I understand this correctly we can just ensure that

the

same operator is used for both the grouping and the join condition.

Well, that's sort of the zero-order solution, but it doesn't work if the
join operators are cross-type.

I poked around to see if we didn't have some code already for that, and
soon found that not only do we have such code (equality_ops_are_compatible)
but actually almost this entire patch duplicates logic that already exists
in optimizer/util/pathnode.c, to wit create_unique_path's subroutines
query_is_distinct_for et al. So I'm thinking what this needs to turn into
is an exercise in refactoring to allow that logic to be used for both
purposes.

Well, it seems that might just reduce the patch size a little!
I currently have this half hacked up to use query_is_distinct_for, but I
see there's no code that allows Const's to exist in the join condition. I
had allowed for this in groupinglist_is_unique_on_restrictinfo() and I
tested it worked in a regression test (which now fails). I think to fix
this, all it would take would be to modify query_is_distinct_for to take a
list of Node's rather than a list of column numbers then just add some
logic that skips if it's a Const and checks it as it does now if it's a Var
Would you see a change of this kind a valid refactor for this patch?

I notice that create_unique_path is not paying attention to the question

of whether the subselect's tlist contains SRFs or volatile functions.
It's possible that that's a pre-existing bug.

*shrug*, perhaps the logic for that is best moved into
query_is_distinct_for then? It might save a bug in the future too that way.

Regards

David Rowley

#48

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#47)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, Jul 7, 2014 at 4:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I poked around to see if we didn't have some code already for that, and
soon found that not only do we have such code (equality_ops_are_compatible)
but actually almost this entire patch duplicates logic that already exists
in optimizer/util/pathnode.c, to wit create_unique_path's subroutines
query_is_distinct_for et al. So I'm thinking what this needs to turn into
is an exercise in refactoring to allow that logic to be used for both
purposes.

Well, it seems that might just reduce the patch size a little!
I currently have this half hacked up to use query_is_distinct_for, but I
see there's no code that allows Const's to exist in the join condition. I
had allowed for this in groupinglist_is_unique_on_restrictinfo() and I
tested it worked in a regression test (which now fails). I think to fix
this, all it would take would be to modify query_is_distinct_for to take a
list of Node's rather than a list of column numbers then just add some
logic that skips if it's a Const and checks it as it does now if it's a Var
Would you see a change of this kind a valid refactor for this patch?

I'm a bit skeptical as to whether testing for that case is actually worth
any extra complexity. Do you have a compelling use-case? But anyway,
if we do want to allow it, why does it take any more than adding a check
for Consts to the loops in query_is_distinct_for? It's the targetlist
entries where we'd want to allow Consts, not the join conditions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#48)

2 attachment(s)

Re: Allowing join removals for more join types

On Tue, Jul 8, 2014 at 4:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, Jul 7, 2014 at 4:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I poked around to see if we didn't have some code already for that, and
soon found that not only do we have such code

(equality_ops_are_compatible)

but actually almost this entire patch duplicates logic that already

exists

in optimizer/util/pathnode.c, to wit create_unique_path's subroutines
query_is_distinct_for et al. So I'm thinking what this needs to turn

into

is an exercise in refactoring to allow that logic to be used for both
purposes.

Well, it seems that might just reduce the patch size a little!
I currently have this half hacked up to use query_is_distinct_for, but I
see there's no code that allows Const's to exist in the join condition. I
had allowed for this in groupinglist_is_unique_on_restrictinfo() and I
tested it worked in a regression test (which now fails). I think to fix
this, all it would take would be to modify query_is_distinct_for to take

a

list of Node's rather than a list of column numbers then just add some
logic that skips if it's a Const and checks it as it does now if it's a

Var

Would you see a change of this kind a valid refactor for this patch?

I'm a bit skeptical as to whether testing for that case is actually worth
any extra complexity. Do you have a compelling use-case? But anyway,
if we do want to allow it, why does it take any more than adding a check
for Consts to the loops in query_is_distinct_for? It's the targetlist
entries where we'd want to allow Consts, not the join conditions.

I don't really have a compelling use-case, but you're right, it's just a
Const check in query_is_distinct_for(), it seems simple enough so I've
included that in my refactor of the patch to use query_is_distinct_for().
This allows the regression tests all to pass again.

I've included an updated patch and a delta patch.

Now a couple of things to note:

1. The fast path code that exited in join_is_removable() for subquery's
when the subquery had no group or distinct clause is now gone. I wasn't too
sure that I wanted to assume too much about what query_is_distinct_for may
do in the future and I thought if I included some logic in
join_is_removable() to fast path, that one day it may fast path wrongly.
Perhaps we could protect against this with a small note in
query_is_distinct_for().

2. The patch I submitted here
/messages/by-id/CAApHDvrfVkH0P3FAooGcckBy7feCJ9QFanKLkX7MWsBcxY2Vcg@mail.gmail.com
if that gets accepted then it makes the check for set returning functions
in join_is_removable void.

Regards

David Rowley

Attachments:

subquery_leftjoin_removal_v1.5.delta.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.5.delta.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 1cdb311..bc1929c 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,20 +22,16 @@
  */
 #include "postgres.h"
 
+#include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
 #include "optimizer/var.h"
-#include "optimizer/clauses.h"
-#include "parser/parsetree.h"
-#include "optimizer/tlist.h"
-#include "nodes/nodeFuncs.h"
 
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
-static bool	groupinglist_is_unique_on_restrictinfo(Query *query,
-					  List *clause_list, List *sortlist);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
 					  Relids joinrelids);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
@@ -153,7 +149,6 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
-	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
@@ -176,8 +171,8 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 *    appear in the join condition with equality operators.
 	 *
 	 * The code below is written with the assumption that join removal is more
-	 * likely not to happen, for this reason there are fast paths for both of
-	 * the cases above to try to save on unnecessary processing.
+	 * likely not to happen, for this reason we try to fast path out of this
+	 * function early when possible.
 	 */
 
 	if (sjinfo->jointype != JOIN_LEFT ||
@@ -200,20 +195,7 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		if (innerrel->indexlist == NIL)
 			return false;
 	}
-	else if (innerrel->rtekind == RTE_SUBQUERY)
-	{
-		subquery = root->simple_rte_array[innerrelid]->subquery;
-
-		/*
-		 * The only means we currently use to check if the subquery is unique
-		 * are the GROUP BY and DISTINCT clause. If both of these are empty
-		 * then there's no point in going any further.
-		 */
-		if (subquery->groupClause == NIL &&
-			subquery->distinctClause == NIL)
-			return false;
-	}
-	else
+	else if (innerrel->rtekind != RTE_SUBQUERY)
 		return false; /* unsupported rtekind */
 
 	/* Compute the relid set for the join we are considering */
@@ -324,134 +306,76 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 		return true;
 
 	/*
-	 * We can be certain that the sub query contains no duplicate values for
-	 * the join clause if item in the sub query's GROUP BY clause is also used
-	 * in the join clause using equality. This works the same way for the
-	 * DISTINCT clause. We need not pay any attention to WHERE or HAVING
-	 * clauses as these just restrict the results more and could not be the
-	 * cause of duplication in the result set. However there are a number of
-	 * pre-checks we must perform which could cause duplicate values even if
-	 * all the required columns are in the GROUP BY or DISTINCT clause.
+	 * For subqueries we should be able to remove the join if the subquery
+	 * can't produce more than 1 record which matches the outer query on the
+	 * join condition. However, there's a few pre-conditions that the subquery
+	 * must meet for it to be safe to remove:
+	 *
+	 * 1. The subquery mustn't contain a FOR UPDATE clause. Removing such a
+	 *    join would have the undesired side affect of not locking the rows.
+	 *
+	 * 2. The subquery mustn't contain any volatile functions. Removing such
+	 *    a join would cause side affects that the volatile functions may have,
+	 *    not to occur.
 	 *
-	 * NB: We must also not remove the join in the subquery contains a
-	 * FOR UDPATE clause, but we can actually skip this check as GROUP BY and
-	 * DISTINCT cannot be used at the same time as FOR UPDATE.
+	 * 3. The subquery mustn't contain any set returning functions. These can
+	 *    cause duplicate records despite the existence of a DISTINCT or
+	 *    GROUP BY clause which could otherwise make the subquery unique.
 	 */
 	if (innerrel->rtekind == RTE_SUBQUERY)
 	{
-		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+		List	*colnos;
+		List	*opids;
+		Query	*subquery = root->simple_rte_array[innerrelid]->subquery;
 
-		/*
-		 * We cannot remove the subquery if the target list contains any set
-		 * returning functions as these may cause the query not to be unique
-		 * on the grouping columns, as per the following example:
-		 * "SELECT a.a,generate_series(1,2) FROM (VALUES(1)) a(a) GROUP BY a"
-		 */
-		if (expression_returns_set((Node *) subquery->targetList))
+		/* check point 1 */
+		if (subquery->hasForUpdate)
 			return false;
 
-		/*
-		 * Don't remove the join if the target list contains any volatile
-		 * functions. Doing so may remove desired side affects that calls
-		 * to the function may cause.
-		 */
+		/* check point 2 */
 		if (contain_volatile_functions((Node *) subquery->targetList))
 			return false;
 
-		/*
-		 * It should be safe to remove the join if all the GROUP BY expressions
-		 * have matching items in the join condition.
-		 */
-		if (subquery->groupClause != NIL &&
-			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->groupClause))
-			return true;
-
-		/*
-		 * It should be safe to remove the join if all the DISTINCT column list have matching
-		 * items in the join condition.
-		 */
-		if (subquery->distinctClause != NIL &&
-			groupinglist_is_unique_on_restrictinfo(subquery, clause_list, subquery->distinctClause))
-			return true;
-	}
-
-	/*
-	 * Some day it would be nice to check for other methods of establishing
-	 * distinctness.
-	 */
-	return false;
-}
-
-/*
- * groupinglist_is_unique_on_restrictinfo
- *
- * Checks to see if all items in groupinglist also exist in rinfolist.
- * The function will return true if rinfolist is the same as or a superset
- * of groupinglist. If the groupinglist has Vars that don't exist in the rinfolist
- * then the query can't be guaranteed unique on the rinfolist columns.
- *
- * Note: The calling function must ensure that groupinglist is not NIL.
- */
-static bool
-groupinglist_is_unique_on_restrictinfo(Query *query, List *rinfolist, List *groupinglist)
-{
-	ListCell *l;
-
-	Assert(groupinglist != NIL);
-
-	/*
-	 * Loop over each groupinglist item to ensure that we have restrictinfo
-	 * item to match. We also need to ensure that the operators used in the
-	 * groupinglist matches that of the one in the restrict info.
-	 * Note that it does not matter if we have more items in the rinfolist than
-	 * we have in the groupinglist.
-	 */
-	foreach(l, groupinglist)
-	{
-		ListCell		*ri;
-		SortGroupClause *scl = (SortGroupClause *) lfirst(l);
-		TargetEntry		*sortTarget;
-		bool			 matched = false;
+		/* check point 3 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
 
-		/* lookup the target list entry for the current sort sort group ref */
-		sortTarget = get_sortgroupref_tle(scl->tleSortGroupRef, query->targetList);
+		colnos = NULL;
+		opids = NULL;
 
 		/*
-		 * We can ignore constants since they have only one value and don't
-		 * affect uniqueness of results.
+		 * Build a list of varattnos that we require the subquery to be unique over.
+		 * We also build a list of the operators that are used with these vars in the
+		 * join condition so that query_is_distinct_for can check that these
+		 * operators are compatible with the GROUP BY or DISTINCT clause in the
+		 * subquery.
 		 */
-		if (IsA(sortTarget->expr, Const))
-			continue;
-
-		foreach(ri, rinfolist)
+		foreach(l, clause_list)
 		{
-			RestrictInfo *rinfo = (RestrictInfo *) lfirst(ri);
-			Node	   *rexpr;
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Var			 *var;
 
 			if (rinfo->outer_is_left)
-				rexpr = get_rightop(rinfo->clause);
-			else
-				rexpr = get_leftop(rinfo->clause);
-
-			if (IsA(rexpr, Var))
-			{
-				Var *var = (Var *)rexpr;
-
-				if (var->varattno == sortTarget->resno &&
-					scl->eqop == rinfo->hashjoinoperator)
-				{
-					matched = true;
-					break; /* match found */
-				}
-			}
+				var = (Var *) get_rightop(rinfo->clause);
 			else
-				return false;
+				var = (Var *) get_leftop(rinfo->clause);
+
+			if (!var || !IsA(var, Var) ||
+				var->varno != innerrelid)
+				continue;
+
+			colnos = lappend_int(colnos, var->varattno);
+			opids = lappend_oid(opids, rinfo->hashjoinoperator);
 		}
 
-		if (!matched)
-			return false;
+		return query_is_distinct_for(subquery, colnos, opids);
 	}
-	return true;
+
+	/*
+	 * Some day it would be nice to check for other methods of establishing
+	 * distinctness.
+	 */
+	return false;
 }
 
 /*
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 4e05dcd..f701954 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -38,7 +38,6 @@ typedef enum
 } PathCostComparison;
 
 static List *translate_sub_tlist(List *tlist, int relid);
-static bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 static Oid	distinct_col_search(int colno, List *colnos, List *opids);
 
 
@@ -1465,7 +1464,7 @@ translate_sub_tlist(List *tlist, int relid)
  * should give trustworthy answers for all operators that we might need
  * to deal with here.)
  */
-static bool
+bool
 query_is_distinct_for(Query *query, List *colnos, List *opids)
 {
 	ListCell   *l;
@@ -1486,6 +1485,13 @@ query_is_distinct_for(Query *query, List *colnos, List *opids)
 			TargetEntry *tle = get_sortgroupclause_tle(sgc,
 													   query->targetList);
 
+			/*
+			 * We can ignore constants since they have only one value and don't
+			 * affect uniqueness of results.
+			 */
+			if (IsA(tle->expr, Const))
+				continue;
+
 			opid = distinct_col_search(tle->resno, colnos, opids);
 			if (!OidIsValid(opid) ||
 				!equality_ops_are_compatible(opid, sgc->eqop))
@@ -1507,6 +1513,13 @@ query_is_distinct_for(Query *query, List *colnos, List *opids)
 			TargetEntry *tle = get_sortgroupclause_tle(sgc,
 													   query->targetList);
 
+			/*
+			 * We can ignore constants since they have only one value and don't
+			 * affect uniqueness of results.
+			 */
+			if (IsA(tle->expr, Const))
+				continue;
+
 			opid = distinct_col_search(tle->resno, colnos, opids);
 			if (!OidIsValid(opid) ||
 				!equality_ops_are_compatible(opid, sgc->eqop))
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a0bcc82..2f571f5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,6 +67,7 @@ extern ResultPath *create_result_path(List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
+extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,

subquery_leftjoin_removal_v1.5.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.5.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..bc1929c 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,8 @@
  */
 #include "postgres.h"
 
+#include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -153,13 +155,26 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	int			attroff;
 
 	/*
-	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * Assuming none of the variables from the join are needed by the query,
+	 * it is possible here to remove a left join providing we can determine
+	 * that the join will never produce more than 1 row that matches the join
+	 * condition.
+	 *
+	 * There are a few ways that we can do this:
+	 *
+	 * 1. When joining to a baserel we can check if a unique index exists
+	 *    where all of the columns of the index are seen in the join condition
+	 *    with equality operators.
+	 *
+	 * 2. When joining to a subquery we can check if the subquery contains a
+	 *    GROUP BY or DISTINCT clause where all of the columns of the clause
+	 *    appear in the join condition with equality operators.
+	 *
+	 * The code below is written with the assumption that join removal is more
+	 * likely not to happen, for this reason we try to fast path out of this
+	 * function early when possible.
 	 */
+
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
@@ -168,11 +183,21 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		/*
+		 * If there are no indexes then there's certainly no unique indexes
+		 * so there's no need to go any further.
+		 */
+		if (innerrel->indexlist == NIL)
+			return false;
+	}
+	else if (innerrel->rtekind != RTE_SUBQUERY)
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,17 +301,83 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * For subqueries we should be able to remove the join if the subquery
+	 * can't produce more than 1 record which matches the outer query on the
+	 * join condition. However, there's a few pre-conditions that the subquery
+	 * must meet for it to be safe to remove:
+	 *
+	 * 1. The subquery mustn't contain a FOR UPDATE clause. Removing such a
+	 *    join would have the undesired side affect of not locking the rows.
+	 *
+	 * 2. The subquery mustn't contain any volatile functions. Removing such
+	 *    a join would cause side affects that the volatile functions may have,
+	 *    not to occur.
+	 *
+	 * 3. The subquery mustn't contain any set returning functions. These can
+	 *    cause duplicate records despite the existence of a DISTINCT or
+	 *    GROUP BY clause which could otherwise make the subquery unique.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		List	*colnos;
+		List	*opids;
+		Query	*subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/* check point 1 */
+		if (subquery->hasForUpdate)
+			return false;
+
+		/* check point 2 */
+		if (contain_volatile_functions((Node *) subquery->targetList))
+			return false;
+
+		/* check point 3 */
+		if (expression_returns_set((Node *) subquery->targetList))
+			return false;
+
+		colnos = NULL;
+		opids = NULL;
+
+		/*
+		 * Build a list of varattnos that we require the subquery to be unique over.
+		 * We also build a list of the operators that are used with these vars in the
+		 * join condition so that query_is_distinct_for can check that these
+		 * operators are compatible with the GROUP BY or DISTINCT clause in the
+		 * subquery.
+		 */
+		foreach(l, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Var			 *var;
+
+			if (rinfo->outer_is_left)
+				var = (Var *) get_rightop(rinfo->clause);
+			else
+				var = (Var *) get_leftop(rinfo->clause);
+
+			if (!var || !IsA(var, Var) ||
+				var->varno != innerrelid)
+				continue;
+
+			colnos = lappend_int(colnos, var->varattno);
+			opids = lappend_oid(opids, rinfo->hashjoinoperator);
+		}
+
+		return query_is_distinct_for(subquery, colnos, opids);
+	}
+
+	/*
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
-
 /*
  * Remove the target relid from the planner's data structures, having
  * determined that there is no need to include it in the query.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 4e05dcd..f701954 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -38,7 +38,6 @@ typedef enum
 } PathCostComparison;
 
 static List *translate_sub_tlist(List *tlist, int relid);
-static bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 static Oid	distinct_col_search(int colno, List *colnos, List *opids);
 
 
@@ -1465,7 +1464,7 @@ translate_sub_tlist(List *tlist, int relid)
  * should give trustworthy answers for all operators that we might need
  * to deal with here.)
  */
-static bool
+bool
 query_is_distinct_for(Query *query, List *colnos, List *opids)
 {
 	ListCell   *l;
@@ -1486,6 +1485,13 @@ query_is_distinct_for(Query *query, List *colnos, List *opids)
 			TargetEntry *tle = get_sortgroupclause_tle(sgc,
 													   query->targetList);
 
+			/*
+			 * We can ignore constants since they have only one value and don't
+			 * affect uniqueness of results.
+			 */
+			if (IsA(tle->expr, Const))
+				continue;
+
 			opid = distinct_col_search(tle->resno, colnos, opids);
 			if (!OidIsValid(opid) ||
 				!equality_ops_are_compatible(opid, sgc->eqop))
@@ -1507,6 +1513,13 @@ query_is_distinct_for(Query *query, List *colnos, List *opids)
 			TargetEntry *tle = get_sortgroupclause_tle(sgc,
 													   query->targetList);
 
+			/*
+			 * We can ignore constants since they have only one value and don't
+			 * affect uniqueness of results.
+			 */
+			if (IsA(tle->expr, Const))
+				continue;
+
 			opid = distinct_col_search(tle->resno, colnos, opids);
 			if (!OidIsValid(opid) ||
 				!equality_ops_are_compatible(opid, sgc->eqop))
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a0bcc82..2f571f5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,6 +67,7 @@ extern ResultPath *create_result_path(List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
+extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c62a63f..4959e5f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3131,9 +3131,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3169,6 +3171,161 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+                                        QUERY PLAN                                        
+------------------------------------------------------------------------------------------
+ Hash Left Join
+   Hash Cond: ((a.id)::double precision = ((((d.a + d.b))::double precision + random())))
+   ->  Seq Scan on a
+   ->  Hash
+         ->  HashAggregate
+               Group Key: (((d.a + d.b))::double precision + random())
+               ->  Seq Scan on d
+(7 rows)
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join. 
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+                   QUERY PLAN                    
+-------------------------------------------------
+ Hash Left Join
+   Hash Cond: (a.id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Subquery Scan on b
+               Filter: (b.r = random())
+               ->  HashAggregate
+                     Group Key: b_1.id, random()
+                     ->  Seq Scan on b b_1
+(9 rows)
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1031f26..21e29d2 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -919,9 +919,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -936,6 +938,71 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- join removal is not possible when distinct contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b+random() AS abr FROM d) d ON a.id = d.abr;
+
+-- check that join removal works for a left join when joining a subquery that
+-- is guaranteed to be unique on the join condition even if it contains a Const.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.c_id,1 AS dummy FROM b) b ON a.id = b.c_id;
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT contains a volatile function
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,random() AS r FROM b) b ON a.id = b.id AND r = random();
+
+-- join removal is not possible when there are any volatile functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,AVG(c_id),SUM(random()) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

#50

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#49)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

On Tue, Jul 8, 2014 at 4:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a bit skeptical as to whether testing for that case is actually worth
any extra complexity. Do you have a compelling use-case? But anyway,
if we do want to allow it, why does it take any more than adding a check
for Consts to the loops in query_is_distinct_for? It's the targetlist
entries where we'd want to allow Consts, not the join conditions.

I don't really have a compelling use-case, but you're right, it's just a
Const check in query_is_distinct_for(), it seems simple enough so I've
included that in my refactor of the patch to use query_is_distinct_for().
This allows the regression tests all to pass again.

Meh. "I wrote a regression test that expects it" is a pretty poor
rationale for adding logic. If you can't point to a real-world case
where this is important, I'm inclined to take it out.

If we were actually serious about exploiting such cases, looking for
bare Consts would be a poor implementation anyhow, not least because
const-folding has not yet been applied to the sub-select. I think we'd
want to do it for any pseudoconstant expression (no Vars, no volatile
functions); which is a substantially more expensive test.

1. The fast path code that exited in join_is_removable() for subquery's
when the subquery had no group or distinct clause is now gone. I wasn't too
sure that I wanted to assume too much about what query_is_distinct_for may
do in the future and I thought if I included some logic in
join_is_removable() to fast path, that one day it may fast path wrongly.

Or put a quick-check subroutine next to query_is_distinct_for(). The code
we're skipping here is not so cheap that I want to blow off skipping it.
On review it looks like analyzejoins.c would possibly benefit from an
earlier fast-path check as well.

2. The patch I submitted here
/messages/by-id/CAApHDvrfVkH0P3FAooGcckBy7feCJ9QFanKLkX7MWsBcxY2Vcg@mail.gmail.com
if that gets accepted then it makes the check for set returning functions
in join_is_removable void.

Right (and done, if you didn't notice already).

TBH I find the checks for FOR UPDATE and volatile functions to be
questionable as well. We have never considered those things to prevent
pushdown of quals into a subquery (cf subquery_is_pushdown_safe). I think
what we're talking about here is pretty much equivalent to pushing an
always-false qual into the subquery; if people haven't complained about
that, why should they complain about this? Or to put it in slightly more
principled terms, we've attempted to prevent subquery optimization from
causing volatile expressions to be evaluated *more* times than the naive
reading of the query would suggest, but we have generally not felt that
we needed to prevent them from happening *fewer* times.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

David Rowley

dgrowley@gmail.com

over 11 years ago

In reply to: Tom Lane (#50)

Re: Allowing join removals for more join types

On 9 July 2014 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

On Tue, Jul 8, 2014 at 4:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a bit skeptical as to whether testing for that case is actually

worth

any extra complexity. Do you have a compelling use-case? But anyway,
if we do want to allow it, why does it take any more than adding a check
for Consts to the loops in query_is_distinct_for? It's the targetlist
entries where we'd want to allow Consts, not the join conditions.

I don't really have a compelling use-case, but you're right, it's just a
Const check in query_is_distinct_for(), it seems simple enough so I've
included that in my refactor of the patch to use query_is_distinct_for().
This allows the regression tests all to pass again.

Meh. "I wrote a regression test that expects it" is a pretty poor
rationale for adding logic. If you can't point to a real-world case
where this is important, I'm inclined to take it out.

Ok, I'll pull that logic back out when I get home tonight.

If we were actually serious about exploiting such cases, looking for
bare Consts would be a poor implementation anyhow, not least because
const-folding has not yet been applied to the sub-select. I think we'd
want to do it for any pseudoconstant expression (no Vars, no volatile
functions); which is a substantially more expensive test.

1. The fast path code that exited in join_is_removable() for subquery's
when the subquery had no group or distinct clause is now gone. I wasn't

too

sure that I wanted to assume too much about what query_is_distinct_for

may

do in the future and I thought if I included some logic in
join_is_removable() to fast path, that one day it may fast path wrongly.

Or put a quick-check subroutine next to query_is_distinct_for(). The code
we're skipping here is not so cheap that I want to blow off skipping it.

Ok, good idea. I'll craft something up tonight along those lines.

On review it looks like analyzejoins.c would possibly benefit from an
earlier fast-path check as well.

Do you mean for non-subqueries? There already is a check to see if the
relation has no indexes.

2. The patch I submitted here

/messages/by-id/CAApHDvrfVkH0P3FAooGcckBy7feCJ9QFanKLkX7MWsBcxY2Vcg@mail.gmail.com

if that gets accepted then it makes the check for set returning functions
in join_is_removable void.

Right (and done, if you didn't notice already).

Thanks, I noticed that this morning. I'll remove the (now) duplicate check
from the patch

TBH I find the checks for FOR UPDATE and volatile functions to be
questionable as well. We have never considered those things to prevent
pushdown of quals into a subquery (cf subquery_is_pushdown_safe). I think
what we're talking about here is pretty much equivalent to pushing an
always-false qual into the subquery; if people haven't complained about
that, why should they complain about this? Or to put it in slightly more
principled terms, we've attempted to prevent subquery optimization from
causing volatile expressions to be evaluated *more* times than the naive
reading of the query would suggest, but we have generally not felt that
we needed to prevent them from happening *fewer* times.

Well, that's a real tough one for me as I only added that based on what you
told me here:

On 20 May 2014 23:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.

regards, tom lane

As far as I know the FOR UPDATE check is pretty much void as of now anyway,
since the current state of query_is_distinct_for() demands that there's
either a DISTINCT, GROUP BY or just a plain old aggregate without any
grouping, which will just return a single row, neither of these will allow
FOR UPDATE anyway. I really just added the check just to protect the code
from possible future additions to query_is_distinct_for() which may add
logic to determine uniqueness by some other means.

So the effort here should be probably be more focused on if we should allow
the join removal when the subquery contains volatile functions. We should
probably think fairly hard on this now as I'm still planning on working on
INNER JOIN removals at some point and I'm thinking we should likely be
consistent between the 2 types of join for when it comes to FOR UPDATE and
volatile functions, and I'm thinking right now that for INNER JOINs that,
since they're INNER that we could remove either side of the join. In that
case maybe it would be harder for the user to understand why their volatile
function didn't get executed.

Saying that... off the top of my head I can't remember what we'd do in a
case like:

create view v_a as select a,volatilefunc(a) AS funcresult from a;

select a from v_a;

Since we didn't select funcresult, do we execute the function?

Perhaps we should base this on whatever that does?

I can't give much more input on that right now. I'll have a look at the
docs later to see if when mention anything about any guarantees about
calling volatile functions.

Regards

David Rowley

#52

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#51)

Re: Allowing join removals for more join types

David Rowley <dgrowley@gmail.com> writes:

On 9 July 2014 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

On review it looks like analyzejoins.c would possibly benefit from an
earlier fast-path check as well.

Do you mean for non-subqueries? There already is a check to see if the
relation has no indexes.

Oh, sorry, that was a typo: I meant to write pathnode.c. Specifically,
we could skip the translate_sub_tlist step. Admittedly that's not
hugely expensive, but as long as we have the infrastructure for a quick
check it might be worth doing.

TBH I find the checks for FOR UPDATE and volatile functions to be
questionable as well.

Well, that's a real tough one for me as I only added that based on what you
told me here:

I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.

Hah ;-). But the analogy to qual pushdown hadn't occurred to me at the
time.

As far as I know the FOR UPDATE check is pretty much void as of now anyway,
since the current state of query_is_distinct_for() demands that there's
either a DISTINCT, GROUP BY or just a plain old aggregate without any
grouping, which will just return a single row, neither of these will allow
FOR UPDATE anyway.

True.

So the effort here should be probably be more focused on if we should allow
the join removal when the subquery contains volatile functions. We should
probably think fairly hard on this now as I'm still planning on working on
INNER JOIN removals at some point and I'm thinking we should likely be
consistent between the 2 types of join for when it comes to FOR UPDATE and
volatile functions, and I'm thinking right now that for INNER JOINs that,
since they're INNER that we could remove either side of the join. In that
case maybe it would be harder for the user to understand why their volatile
function didn't get executed.

Color me dubious. In exactly what circumstances would it be valid to
suppress an inner join involving a sub-select?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#52)

1 attachment(s)

Re: Allowing join removals for more join types

On Wed, Jul 9, 2014 at 12:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowley@gmail.com> writes:

On 9 July 2014 09:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

On review it looks like analyzejoins.c would possibly benefit from an
earlier fast-path check as well.

Do you mean for non-subqueries? There already is a check to see if the
relation has no indexes.

Oh, sorry, that was a typo: I meant to write pathnode.c. Specifically,
we could skip the translate_sub_tlist step. Admittedly that's not
hugely expensive, but as long as we have the infrastructure for a quick
check it might be worth doing.

TBH I find the checks for FOR UPDATE and volatile functions to be
questionable as well.

Well, that's a real tough one for me as I only added that based on what

you

told me here:

I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.

Hah ;-). But the analogy to qual pushdown hadn't occurred to me at the
time.

Ok, I've removed the check for volatile functions and FOR UPDATE.

As far as I know the FOR UPDATE check is pretty much void as of now

anyway,

since the current state of query_is_distinct_for() demands that there's
either a DISTINCT, GROUP BY or just a plain old aggregate without any
grouping, which will just return a single row, neither of these will

allow

FOR UPDATE anyway.

True.

So the effort here should be probably be more focused on if we should

allow

the join removal when the subquery contains volatile functions. We should
probably think fairly hard on this now as I'm still planning on working

on

INNER JOIN removals at some point and I'm thinking we should likely be
consistent between the 2 types of join for when it comes to FOR UPDATE

and

volatile functions, and I'm thinking right now that for INNER JOINs that,
since they're INNER that we could remove either side of the join. In that
case maybe it would be harder for the user to understand why their

volatile

function didn't get executed.

Color me dubious. In exactly what circumstances would it be valid to
suppress an inner join involving a sub-select?

hmm, probably I didn't think this through enough before commenting as I
don't actually have any plans for subselects with INNER JOINs. Though
saying that I guess there are cases that can be removed... Anything that
queries a single table that has a foreign key matching the join condition,
where the subquery does not filter or group the results. Obviously
something about the query would have to exist that caused it not to be
flattened, perhaps some windowing functions...

I've attached an updated patch which puts in some fast path code for
subquery type joins. I'm really not too sure on a good name for this
function. I've ended up with query_supports_distinctness() which I'm not
that keen on, but I didn't manage to come up with anything better.

Regards

David Rowley

Attachments:

subquery_leftjoin_removal_v1.6.patchapplication/octet-stream; name=subquery_leftjoin_removal_v1.6.patchDownload

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 129fc3d..2d9fe96 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,8 @@
  */
 #include "postgres.h"
 
+#include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -147,19 +149,33 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 {
 	int			innerrelid;
 	RelOptInfo *innerrel;
+	Query	   *subquery;
 	Relids		joinrelids;
 	List	   *clause_list = NIL;
 	ListCell   *l;
 	int			attroff;
 
 	/*
-	 * Currently, we only know how to remove left joins to a baserel with
-	 * unique indexes.  We can check most of these criteria pretty trivially
-	 * to avoid doing useless extra work.  But checking whether any of the
-	 * indexes are unique would require iterating over the indexlist, so for
-	 * now we just make sure there are indexes of some sort or other.  If none
-	 * of them are unique, join removal will still fail, just slightly later.
+	 * Assuming none of the variables from the join are needed by the query,
+	 * it is possible here to remove a left join providing we can determine
+	 * that the join will never produce more than 1 row that matches the join
+	 * condition.
+	 *
+	 * There are a few ways that we can do this:
+	 *
+	 * 1. When joining to a baserel we can check if a unique index exists
+	 *    where all of the columns of the index are seen in the join condition
+	 *    using suitable operators.
+	 *
+	 * 2. When joining to a subquery we can check if the subquery contains a
+	 *    GROUP BY clause, DISTINCT clause or set operators which will make it
+	 *    unique on the join condition.
+	 *
+	 * The code below is written with the assumption that join removal is more
+	 * than likely not to happen, for this reason there are fast paths for both of
+	 * the cases above to try to save on any unnecessary processing.
 	 */
+
 	if (sjinfo->jointype != JOIN_LEFT ||
 		sjinfo->delay_upper_joins ||
 		bms_membership(sjinfo->min_righthand) != BMS_SINGLETON)
@@ -168,11 +184,32 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	innerrelid = bms_singleton_member(sjinfo->min_righthand);
 	innerrel = find_base_rel(root, innerrelid);
 
-	if (innerrel->reloptkind != RELOPT_BASEREL ||
-		innerrel->rtekind != RTE_RELATION ||
-		innerrel->indexlist == NIL)
+	if (innerrel->reloptkind != RELOPT_BASEREL)
 		return false;
 
+	if (innerrel->rtekind == RTE_RELATION)
+	{
+		/*
+		 * If there are no indexes then there's certainly no unique indexes
+		 * so there's no need to go any further.
+		 */
+		if (innerrel->indexlist == NIL)
+			return false;
+	}
+	else if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		subquery = root->simple_rte_array[innerrelid]->subquery;
+
+		/*
+		 * If the subquery has no qualities that support distinctness of any
+		 * kind then we can be certain that we cannot remove the join.
+		 */
+		if (!query_supports_distinctness(subquery))
+			return false;
+	}
+	else
+		return false; /* unsupported rtekind */
+
 	/* Compute the relid set for the join we are considering */
 	joinrelids = bms_union(sjinfo->min_lefthand, sjinfo->min_righthand);
 
@@ -276,17 +313,61 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
 	 */
 
 	/* Now examine the indexes to see if we have a matching unique index */
-	if (relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
+	if (innerrel->rtekind == RTE_RELATION &&
+		relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
 		return true;
 
 	/*
+	 * For subqueries we should be able to remove the join if the subquery
+	 * can't produce more than 1 record that matches the outer query on the
+	 * join condition.
+	 */
+	if (innerrel->rtekind == RTE_SUBQUERY)
+	{
+		List	*colnos = NULL;
+		List	*opids = NULL;
+
+		Assert(subquery == root->simple_rte_array[innerrelid]->subquery);
+
+		/*
+		 * Build a list of varattnos that we require the subquery to be unique
+		 * over. We also build a list of the operators that are used with these
+		 * vars in the join condition so that query_is_distinct_for can check
+		 * that these operators are compatible with any GROUP BY, DISTINCT or
+		 * UNIQUE clauses in the subquery.
+		 */
+		foreach(l, clause_list)
+		{
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+			Var			 *var;
+
+			if (rinfo->outer_is_left)
+				var = (Var *) get_rightop(rinfo->clause);
+			else
+				var = (Var *) get_leftop(rinfo->clause);
+
+			/*
+			 * query_is_distinct_for only supports Vars, so anything that's not
+			 * a var will mean the join cannot be removed.
+			 */
+			if (!var || !IsA(var, Var) ||
+				var->varno != innerrelid)
+				return false;
+
+			colnos = lappend_int(colnos, var->varattno);
+			opids = lappend_oid(opids, rinfo->hashjoinoperator);
+		}
+
+		return query_is_distinct_for(subquery, colnos, opids);
+	}
+
+	/*
 	 * Some day it would be nice to check for other methods of establishing
 	 * distinctness.
 	 */
 	return false;
 }
 
-
 /*
  * Remove the target relid from the planner's data structures, having
  * determined that there is no need to include it in the query.
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d129f8d..119fb34 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -38,7 +38,6 @@ typedef enum
 } PathCostComparison;
 
 static List *translate_sub_tlist(List *tlist, int relid);
-static bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 static Oid	distinct_col_search(int colno, List *colnos, List *opids);
 
 
@@ -1451,6 +1450,32 @@ translate_sub_tlist(List *tlist, int relid)
 }
 
 /*
+ * query_supports_distinctness - True if the query can be seen to be distinct
+ *		on some set of columns.
+ *
+ * This is effectively a pre-checking function for query_is_distinct_for,
+ * and can be used by any code that needs to make a call to
+ * query_is_distinct_for, but has to perform some possibly expensive processing
+ * beforehand. If this function returns False then a call to
+ * query_is_distinct_for will also return False, though the reverse is not true
+ * as this would depend on the columns and operators passed to
+ * query_is_distinct_for.
+ */
+bool
+query_supports_distinctness(Query *query)
+{
+	if (query->distinctClause != NIL ||
+		query->groupClause != NIL ||
+		query->hasAggs ||
+		query->havingQual ||
+		query->setOperations)
+		return true;
+
+	return false;
+}
+
+
+/*
  * query_is_distinct_for - does query never return duplicates of the
  *		specified columns?
  *
@@ -1465,12 +1490,16 @@ translate_sub_tlist(List *tlist, int relid)
  * should give trustworthy answers for all operators that we might need
  * to deal with here.)
  */
-static bool
+bool
 query_is_distinct_for(Query *query, List *colnos, List *opids)
 {
 	ListCell   *l;
 	Oid			opid;
 
+	/* XXX the logic used to test for distinctness here must be followed by
+	 * query_supports_distinctness. If this function returns True, then
+	 * query_supports_distinctness must also return True.
+	 */
 	Assert(list_length(colnos) == list_length(opids));
 
 	/*
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a0bcc82..1df6e50 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -67,6 +67,8 @@ extern ResultPath *create_result_path(List *quals);
 extern MaterialPath *create_material_path(RelOptInfo *rel, Path *subpath);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 				   Path *subpath, SpecialJoinInfo *sjinfo);
+extern bool query_supports_distinctness(Query *query);
+extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
 extern Path *create_subqueryscan_path(PlannerInfo *root, RelOptInfo *rel,
 						 List *pathkeys, Relids required_outer);
 extern Path *create_functionscan_path(PlannerInfo *root, RelOptInfo *rel,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c62a63f..9f7bff2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3131,9 +3131,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
   QUERY PLAN   
@@ -3169,6 +3171,117 @@ select id from a where id in (
          ->  Seq Scan on b
 (5 rows)
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+           QUERY PLAN            
+---------------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id, b.c_id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+         QUERY PLAN         
+----------------------------
+ Hash Right Join
+   Hash Cond: (b.id = a.id)
+   ->  HashAggregate
+         Group Key: b.id
+         ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+(7 rows)
+
+-- check join removal works when uniqueness of the join condition is enforced by 
+-- a UNION 
+EXPLAIN (COSTS OFF)
+SELECT a.* FROM A LEFT OUTER JOIN (SELECT id FROM b UNION SELECT id from b) b on a.id=b.id;
+  QUERY PLAN   
+---------------
+ Seq Scan on a
+(1 row)
+
 rollback;
 create temp table parent (k int primary key, pd int);
 create temp table child (k int unique, cd int);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1031f26..a9ec9a8 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -919,9 +919,11 @@ begin;
 CREATE TEMP TABLE a (id int PRIMARY KEY, b_id int);
 CREATE TEMP TABLE b (id int PRIMARY KEY, c_id int);
 CREATE TEMP TABLE c (id int PRIMARY KEY);
+CREATE TEMP TABLE d (a INT, b INT);
 INSERT INTO a VALUES (0, 0), (1, NULL);
 INSERT INTO b VALUES (0, 0), (1, NULL);
 INSERT INTO c VALUES (0), (1);
+INSERT INTO d VALUES (1,3),(2,2),(3,1);
 
 -- all three cases should be optimizable into a simple seqscan
 explain (costs off) SELECT a.* FROM a LEFT JOIN b ON a.b_id = b.id;
@@ -936,6 +938,59 @@ select id from a where id in (
 	select b.id from b left join c on b.id = c.id
 );
 
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id) b ON a.b_id = b.id;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the GROUP BY clause
+-- which contains more than 1 column.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id,b.c_id FROM b GROUP BY b.id,b.c_id) b ON a.b_id = b.id AND a.id = b.c_id;
+
+-- check that join removal works for a left join when joining a subquery
+-- where the join condition is a superset of the columns in the GROUP BY.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,c_id FROM b GROUP BY b.id) b ON a.id = b.id AND b.c_id = 1;
+
+-- check that join removal works for a left join when joining a subquery
+-- that is guaranteed to be unique on the join condition by the DISTINCT clause
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT a+b AS ab FROM d) d ON a.id = d.ab;
+
+-- check join removal works when joining to a subquery that is guaranteed to be
+-- unique on the join condition even when the subquery itself involves a join.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id + 10 = b.id;
+
+-- check join removal works with a left join when joining a unique subquery which
+-- contains 2 tables where the uniqueness enforced by the GROUP BY clause is a
+-- subset of the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a
+LEFT JOIN (SELECT b.id,1 as dummy FROM b INNER JOIN c ON b.id = c.id GROUP BY b.id) b ON a.id = b.id AND b.dummy = 1;
+
+-- join removal is not possible when the GROUP BY contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT b.id FROM b GROUP BY b.id,b.c_id) b ON a.id = b.id;
+
+-- join removal is not possible when DISTINCT clause contains a column which is
+-- not in the join condition.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT DISTINCT b.id,c_id FROM b) b ON a.id = b.id;
+
+-- join removal is not possible when there are set returning functions in the target list.
+EXPLAIN (COSTS OFF)
+SELECT a.id FROM a LEFT JOIN (SELECT id,generate_series(1,2) FROM b GROUP BY id) b ON a.id = b.id;
+
+-- check join removal works when uniqueness of the join condition is enforced by
+-- a UNION
+EXPLAIN (COSTS OFF)
+SELECT a.* FROM A LEFT OUTER JOIN (SELECT id FROM b UNION SELECT id from b) b on a.id=b.id;
+
 rollback;
 
 create temp table parent (k int primary key, pd int);

#54

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#53)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

I've attached an updated patch which puts in some fast path code for
subquery type joins. I'm really not too sure on a good name for this
function. I've ended up with query_supports_distinctness() which I'm not
that keen on, but I didn't manage to come up with anything better.

I've committed this with some mostly but not entirely cosmetic changes.
Notably, I felt that pathnode.c was a pretty questionable place to be
exporting distinctness-proof logic from, and after some reflection decided
to move those functions to analyzejoins.c; that's certainly a better place
for them than pathnode.c, and I don't see any superior third alternative.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#54)

Re: Allowing join removals for more join types

On Wed, Jul 16, 2014 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I've attached an updated patch which puts in some fast path code for
subquery type joins. I'm really not too sure on a good name for this
function. I've ended up with query_supports_distinctness() which I'm not
that keen on, but I didn't manage to come up with anything better.

I've committed this with some mostly but not entirely cosmetic changes.

Great! thanks for taking the time to give me guidance on this and commit it
too.

Simon, thank you for taking the time to review the code.

Notably, I felt that pathnode.c was a pretty questionable place to be

exporting distinctness-proof logic from, and after some reflection decided
to move those functions to analyzejoins.c; that's certainly a better place
for them than pathnode.c, and I don't see any superior third alternative.

That seems like a good change. Also makes be wonder a bit
why clause_sides_match_join is duplicated in joinpath.c and analyzejoins.c,
is this just so that it can be inlined?

Thanks also for making the change to create_unique_path to make use of the
new query_supports_distinctness function.

Regards

David Rowley

#56

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David Rowley (#55)

Re: Allowing join removals for more join types

David Rowley <dgrowleyml@gmail.com> writes:

On Wed, Jul 16, 2014 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Notably, I felt that pathnode.c was a pretty questionable place to be
exporting distinctness-proof logic from, and after some reflection decided
to move those functions to analyzejoins.c; that's certainly a better place
for them than pathnode.c, and I don't see any superior third alternative.

That seems like a good change. Also makes be wonder a bit
why clause_sides_match_join is duplicated in joinpath.c and analyzejoins.c,
is this just so that it can be inlined?

Hm ... probably just didn't seem worth the trouble to try to share the
code. It's not really something that either module should be exporting.
I guess some case could be made for exporting it from util/restrictinfo.c,
but it'd still seem like a bit of a wart.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers