Allowing NOT IN to use ANTI joins

Started by David Rowleyover 11 years ago47 messages
#1David Rowley
dgrowleyml@gmail.com
1 attachment(s)

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

A simple example of this (without a subquery) is:

select 1 where 3 not in (1, 2, null); returns 0 rows because 3 <> NULL is
unknown.

The attached patch allows an ANTI-join plan to be generated in cases like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);

SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

To generate a plan like:
QUERY PLAN
-----------------------------------------------------------------
Hash Anti Join (cost=64.00..137.13 rows=1070 width=8)
Hash Cond: (a.b_id = b.id)
-> Seq Scan on a (cost=0.00..31.40 rows=2140 width=8)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)

But if we then do:
ALTER TABLE b ALTER COLUMN id DROP NOT NULL;

The plan will go back to the current behaviour of:

QUERY PLAN
-------------------------------------------------------------
Seq Scan on a (cost=40.00..76.75 rows=1070 width=8)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)

Comments are welcome

Regards

David Rowley

Attachments:

not_in_anti_join_v0.4.patchapplication/octet-stream; name=not_in_anti_join_v0.4.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index be92049..45b4104 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -27,6 +27,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/subselect.h"
 #include "optimizer/var.h"
+#include "parser/parse_clause.h"
 #include "parser/parse_relation.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/builtins.h"
@@ -1176,7 +1177,7 @@ SS_process_ctes(PlannerInfo *root)
  */
 JoinExpr *
 convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
-							Relids available_rels)
+							bool under_not, Relids available_rels)
 {
 	JoinExpr   *result;
 	Query	   *parse = root->parse;
@@ -1191,6 +1192,16 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink->subLinkType == ANY_SUBLINK);
 
 	/*
+	 * For the case of NOT IN, we can only perform this optimization if we are
+	 * guaranteed that we'll never get NULL values from the subquery results.
+	 * For example "SELECT 1 WHERE 3 NOT IN(1,2,NULL)", the query will produce
+	 * 0 rows due to the fact that 3 <> NULL is unknown.
+	 */
+	if (under_not &&
+		!targetListIsGuaranteedNotToHaveNulls(subselect->targetList))
+		return NULL;
+
+	/*
 	 * The sub-select must not refer to any Vars of the parent query. (Vars of
 	 * higher levels should be okay, though.)
 	 */
@@ -1256,7 +1267,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * And finally, build the JoinExpr node.
 	 */
 	result = makeNode(JoinExpr);
-	result->jointype = JOIN_SEMI;
+	result->jointype = under_not ? JOIN_ANTI : JOIN_SEMI;
 	result->isNatural = false;
 	result->larg = NULL;		/* caller must fill this in */
 	result->rarg = (Node *) rtr;
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 776fe42..7ea165d 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -334,7 +334,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 		/* Is it a convertible ANY or EXISTS clause? */
 		if (sublink->subLinkType == ANY_SUBLINK)
 		{
-			if ((j = convert_ANY_sublink_to_join(root, sublink,
+			if ((j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels1)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -360,7 +360,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 				return NULL;
 			}
 			if (available_rels2 != NULL &&
-				(j = convert_ANY_sublink_to_join(root, sublink,
+				(j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels2)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -452,7 +452,61 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 
 		if (sublink && IsA(sublink, SubLink))
 		{
-			if (sublink->subLinkType == EXISTS_SUBLINK)
+			if (sublink->subLinkType == ANY_SUBLINK)
+			{
+				if ((j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels1)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink1;
+					*jtlink1 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+				if (available_rels2 != NULL &&
+					(j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels2)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink2;
+					*jtlink2 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+			}
+			else if (sublink->subLinkType == EXISTS_SUBLINK)
 			{
 				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
 												   available_rels1)) != NULL)
@@ -506,6 +560,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 					return NULL;
 				}
 			}
+
 		}
 		/* Else return it unmodified */
 		return node;
diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index fcee137..062b6b1 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -2406,6 +2406,61 @@ assignSortGroupRef(TargetEntry *tle, List *tlist)
 	return tle->ressortgroupref;
 }
 
+
+/*
+ * targetListIsGuaranteedNotToHaveNulls
+ *		true if the given target list only contains columns that have
+ *		NOT NULL constraints to protect them from getting NULL values
+ *		and Constants that are not null.
+ *
+ * Note: if this function returns false then it does not mean that the
+ *       targetList will have NULLs, it just means we can't guarantee
+ *       that it does not.
+ *
+ * Note: resjunk targets are ignored here.
+ */
+bool
+targetListIsGuaranteedNotToHaveNulls(List *targetList)
+{
+	ListCell *tl;
+
+	foreach(tl, targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tl);
+
+		if (tle->resjunk)
+			continue;
+
+		/*
+		 * For Constant expressions we can tell directly if they're NULL or
+		 * not. If we know the Const expression is not null then we can
+		 * continue checking, but if we find the Const to be NULL then must
+		 * return false
+		 */
+		if (IsA(tle->expr, Const))
+		{
+			Const *expr = (Const *) tle->expr;
+
+			if (expr->constisnull)
+				return false;
+			else
+				continue; /* It's okay */
+		}
+
+		/*
+		 * We can't know if the target cannot contain NULL values if the
+		 * target does not belong to a table.
+		 */
+		if (!OidIsValid(tle->resorigtbl))
+			return false;
+
+
+		if (!get_attnotnull(tle->resorigtbl, tle->resorigcol))
+			return false;
+	}
+	return true;
+}
+
 /*
  * targetIsInSortList
  *		Is the given target item already in the sortlist?
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4b5ef99..c98adc7 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -816,6 +816,37 @@ get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum)
 /*				---------- ATTRIBUTE CACHES ----------					 */
 
 /*
+ * get_attnotnull
+ *		Returns true if pg_attribute.attnotnull is true, otherwise returns
+ *		false. An error is raised if no record is found for the relid/attnum.
+ *
+ * Note: Calling functions should be careful and test relid for InvalidOid
+ * before calling this function.
+ */
+bool
+get_attnotnull(Oid relid, AttrNumber attnum)
+{
+	HeapTuple	tp;
+
+	tp = SearchSysCache2(ATTNUM,
+						 ObjectIdGetDatum(relid),
+						 Int16GetDatum(attnum));
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+		bool result = att_tup->attnotnull;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+	{
+		elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+			attnum, relid);
+		return false; /* keep compiler quiet */
+	}
+}
+
+/*
  * get_attname
  *		Given the relation id and the attribute number,
  *		return the "attname" field from the attribute relation.
diff --git a/src/include/optimizer/subselect.h b/src/include/optimizer/subselect.h
index 5607e98..3e8bfe7 100644
--- a/src/include/optimizer/subselect.h
+++ b/src/include/optimizer/subselect.h
@@ -18,6 +18,7 @@
 extern void SS_process_ctes(PlannerInfo *root);
 extern JoinExpr *convert_ANY_sublink_to_join(PlannerInfo *root,
 							SubLink *sublink,
+							bool under_not,
 							Relids available_rels);
 extern JoinExpr *convert_EXISTS_sublink_to_join(PlannerInfo *root,
 							   SubLink *sublink,
diff --git a/src/include/parser/parse_clause.h b/src/include/parser/parse_clause.h
index e9e7cdc..9aab0cb 100644
--- a/src/include/parser/parse_clause.h
+++ b/src/include/parser/parse_clause.h
@@ -47,5 +47,7 @@ extern List *addTargetToSortList(ParseState *pstate, TargetEntry *tle,
 					bool resolveUnknown);
 extern Index assignSortGroupRef(TargetEntry *tle, List *tlist);
 extern bool targetIsInSortList(TargetEntry *tle, Oid sortop, List *sortList);
+extern bool targetListIsGuaranteedNotToHaveNulls(List *targetList);
+
 
 #endif   /* PARSE_CLAUSE_H */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index f46460a..5e7d946 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -63,6 +63,7 @@ extern List *get_op_btree_interpretation(Oid opno);
 extern bool equality_ops_are_compatible(Oid opno1, Oid opno2);
 extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype,
 				  int16 procnum);
+extern bool get_attnotnull(Oid relid, AttrNumber attnum);
 extern char *get_attname(Oid relid, AttrNumber attnum);
 extern char *get_relid_attribute_name(Oid relid, AttrNumber attnum);
 extern AttrNumber get_attnum(Oid relid, const char *attname);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index ce15af7..c97571b 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -739,3 +739,65 @@ select * from int4_tbl where
   0
 (1 row)
 
+--
+-- Check NOT IN performs ANTI JOIN when subquery columns are NOT NULL
+-- and does not when subquery columns can contain NULLs.
+--
+BEGIN;
+CREATE TEMP TABLE a (id INT PRIMARY KEY, b_id INT NULL);
+CREATE TEMP TABLE b (id INT NULL);
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE b_id NOT IN (SELECT id FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+ALTER TABLE b ALTER COLUMN id SET NOT NULL;
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE b_id NOT IN (SELECT id FROM b);
+          QUERY PLAN          
+------------------------------
+ Hash Anti Join
+   Hash Cond: (a.b_id = b.id)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE b_id NOT IN (SELECT id+1 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE (id,b_id) NOT IN (SELECT 1,id FROM b);
+          QUERY PLAN          
+------------------------------
+ Hash Anti Join
+   Hash Cond: (a.b_id = b.id)
+   Join Filter: (a.id = 1)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+(6 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE (id,b_id) NOT IN (SELECT NULL::int,id FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 326fd70..96b2c44 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -422,3 +422,32 @@ select * from int4_tbl where
 select * from int4_tbl where
   (case when f1 in (select unique1 from tenk1 a) then f1 else null end) in
   (select ten from tenk1 b);
+
+--
+-- Check NOT IN performs ANTI JOIN when subquery columns are NOT NULL
+-- and does not when subquery columns can contain NULLs.
+--
+
+BEGIN;
+
+CREATE TEMP TABLE a (id INT PRIMARY KEY, b_id INT NULL);
+CREATE TEMP TABLE b (id INT NULL);
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE b_id NOT IN (SELECT id FROM b);
+
+ALTER TABLE b ALTER COLUMN id SET NOT NULL;
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE b_id NOT IN (SELECT id FROM b);
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE b_id NOT IN (SELECT id+1 FROM b);
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE (id,b_id) NOT IN (SELECT 1,id FROM b);
+
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE (id,b_id) NOT IN (SELECT NULL::int,id FROM b);
+
+ROLLBACK;
#2Martijn van Oosterhout
kleptog@svana.org
In reply to: David Rowley (#1)
Re: Allowing NOT IN to use ANTI joins

On Mon, Jun 09, 2014 at 12:36:30AM +1200, David Rowley wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

Awesome. I've had a brief look at the patch and other than a line of
extraneous whitespace it looks sane.

Since it is only testing on NOT IN queries I don't think there are any
issues with it slowing down simple queries.

I also note you can't prove "id+1" not null. At first I thought you
might be able to prove this not null if the operator/function was
strict, but then I realised that strict only means "null if input is
null" not "output is only null if inputs are null". Pity.

Nice work.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#3Marti Raudsepp
marti@juffo.org
In reply to: David Rowley (#1)
Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS
queries and leaves NOT IN alone. The reason for this is because the values
returned by a subquery in the IN clause could have NULLs.

I believe the reason why this hasn't been done yet, is that the plan
becomes invalid when another backend modifies the nullability of the
column. To get it to replan, you'd have to introduce a dependency on
the "NOT NULL" constraint, but it's impossible for now because there's
no pg_constraint entry for NOT NULLs.

The only way to consistently guarantee nullability is through primary
key constraints. Fortunately that addresses most of the use cases of
NOT IN(), in my experience.

See the comment in check_functional_grouping:

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

The behavior you want seems somewhat similar to
check_functional_grouping; maybe you could unify it with your
targetListIsGuaranteedNotToHaveNulls at some level. (PS: that's one
ugly function name :)

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Vik Fearing
vik.fearing@dalibo.com
In reply to: David Rowley (#1)
Re: Allowing NOT IN to use ANTI joins

On 06/08/2014 02:36 PM, David Rowley wrote:

+		if (!get_attnotnull(tle->resorigtbl, tle->resorigcol))
+			return false;

As Marti says, you can't do this because NOT NULL doesn't have an oid to
attach a dependency to. You'll have to restrict this test to primary
keys only for now.
--
Vik

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marti Raudsepp (#3)
Re: Allowing NOT IN to use ANTI joins

Marti Raudsepp <marti@juffo.org> writes:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS
queries and leaves NOT IN alone. The reason for this is because the values
returned by a subquery in the IN clause could have NULLs.

I believe the reason why this hasn't been done yet, is that the plan
becomes invalid when another backend modifies the nullability of the
column. To get it to replan, you'd have to introduce a dependency on
the "NOT NULL" constraint, but it's impossible for now because there's
no pg_constraint entry for NOT NULLs.

I don't believe this is an issue, because we are only talking about a
*plan* depending on the NOT NULL condition. ALTER TABLE DROP NOT NULL
would result in a relcache inval event against the table, which would
result in invalidating all cached plans mentioning the table.

I forget exactly what context we were discussing needing a NOT NULL
constraint's OID for, but it would have to be something where the
dependency was longer-lived than a plan; perhaps semantics of a view?
The existing comparable case is that a view containing ungrouped
variable references is allowed if the GROUP BY includes a primary key,
which means the semantic validity of the view depends on the pkey.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jeff Janes
jeff.janes@gmail.com
In reply to: David Rowley (#1)
Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 8, 2014 at 5:36 AM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

A simple example of this (without a subquery) is:

select 1 where 3 not in (1, 2, null); returns 0 rows because 3 <> NULL is
unknown.

The attached patch allows an ANTI-join plan to be generated in cases like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);

SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

To generate a plan like:
QUERY PLAN
-----------------------------------------------------------------
Hash Anti Join (cost=64.00..137.13 rows=1070 width=8)
Hash Cond: (a.b_id = b.id)
-> Seq Scan on a (cost=0.00..31.40 rows=2140 width=8)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)

I think this will be great, I've run into this problem often from
applications I have no control over. I thought a more complete, but
probably much harder, solution would be to add some metadata to the hash
anti-join infrastructure that tells it "If you find any nulls in the outer
scan, stop running without returning any rows". I think that should work
because the outer rel already has to run completely before any rows can be
returned.

But what I can't figure out is, would that change obviate the need for your
change? Once we can correctly deal with nulls in a NOT IN list through a
hash anti join, is there a cost estimation advantage to being able to prove
that the that null can't occur? (And of course if you have code that
works, while I have vague notions of what might be, then my notion probably
does not block your code.)

Cheers,

Jeff

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#6)
Re: Allowing NOT IN to use ANTI joins

Jeff Janes <jeff.janes@gmail.com> writes:

On Sun, Jun 8, 2014 at 5:36 AM, David Rowley <dgrowleyml@gmail.com> wrote:

The attached patch allows an ANTI-join plan to be generated in cases like:
CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);
SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

I think this will be great, I've run into this problem often from
applications I have no control over. I thought a more complete, but
probably much harder, solution would be to add some metadata to the hash
anti-join infrastructure that tells it "If you find any nulls in the outer
scan, stop running without returning any rows". I think that should work
because the outer rel already has to run completely before any rows can be
returned.

Huh? The point of an antijoin (or indeed most join methods) is that we
*don't* have to examine the whole inner input to make a decision.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Tom Lane (#7)
Re: Allowing NOT IN to use ANTI joins

On Monday, June 9, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com <javascript:;>> writes:

On Sun, Jun 8, 2014 at 5:36 AM, David Rowley <dgrowleyml@gmail.com

<javascript:;>> wrote:

The attached patch allows an ANTI-join plan to be generated in cases

like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);
SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

I think this will be great, I've run into this problem often from
applications I have no control over. I thought a more complete, but
probably much harder, solution would be to add some metadata to the hash
anti-join infrastructure that tells it "If you find any nulls in the

outer

scan, stop running without returning any rows". I think that should work
because the outer rel already has to run completely before any rows can

be

returned.

Huh? The point of an antijoin (or indeed most join methods) is that we
*don't* have to examine the whole inner input to make a decision.

But all hash join methods needs to examine the entire *outer* input, no?
Have I screwed up my terminology here?

If you are using NOT IN, then once you find a NULL in the outer input (if
the outer input is the in-list: clearly you can't reverse the two inputs in
this case), you don't even need to finish reading the outer input, nor
start reading the inner input, because all rows are automatically excluded
by the weird semantics of NOT IN.

Cheers,

Jeff

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#8)
Re: Allowing NOT IN to use ANTI joins

Jeff Janes <jeff.janes@gmail.com> writes:

On Monday, June 9, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh? The point of an antijoin (or indeed most join methods) is that we
*don't* have to examine the whole inner input to make a decision.

But all hash join methods needs to examine the entire *outer* input, no?
Have I screwed up my terminology here?

I think you're confusing inner and outer inputs --- the sub-select inside
the NOT IN is the inner input according to the way I think about it.
But I had assumed that was what you meant already.

If you are using NOT IN, then once you find a NULL in the outer input (if
the outer input is the in-list: clearly you can't reverse the two inputs in
this case), you don't even need to finish reading the outer input, nor
start reading the inner input, because all rows are automatically excluded
by the weird semantics of NOT IN.

The point I'm trying to make is that the goal of most join types is to
give an answer without having necessarily read all of either input.
For instance, if we tried to do this with a mergejoin it wouldn't work
reliably: it might suppose that it could accept an outer row on the basis
of no match in a higher-order sort column before it'd reached any nulls
appearing in lower-order sort columns.

You might be right that we could hot-wire the hash join case in
particular, but I'm failing to see the point of expending lots of extra
effort in order to deliver a useless answer faster. If there are NULLs
in the inner input, then NOT IN is simply the wrong query to make, and
giving an empty output in a relatively short amount of time isn't going
to help clue the user in on that. (If the SQL standard would let us do
so, I'd be arguing for throwing an error if we found a NULL.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10David Rowley
dgrowleyml@gmail.com
In reply to: Marti Raudsepp (#3)
Re: Allowing NOT IN to use ANTI joins

On Mon, Jun 9, 2014 at 11:20 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT

EXISTS

queries and leaves NOT IN alone. The reason for this is because the

values

returned by a subquery in the IN clause could have NULLs.

I believe the reason why this hasn't been done yet, is that the plan
becomes invalid when another backend modifies the nullability of the
column. To get it to replan, you'd have to introduce a dependency on
the "NOT NULL" constraint, but it's impossible for now because there's
no pg_constraint entry for NOT NULLs.

The only way to consistently guarantee nullability is through primary
key constraints. Fortunately that addresses most of the use cases of
NOT IN(), in my experience.

I tried to break this by putting a break point
in convert_ANY_sublink_to_join in session 1. Not that it really had to be
in that function, I just wanted it to stop during planning and before the
plan is executed.

-- session 1
select * from n1 where id not in(select id from n1); -- hits breakpoint in
convert_ANY_sublink_to_join

-- session 2
alter table n2 alter column id drop not null;

insert into n2 values(null);

I see that session 2 blocks in the alter table until session 1 completes.

I've not really checked out the code in detail around when the snapshot is
taken and the transaction ID is generated, but as long as the transaction
id is taken before we start planning in session 1 then it should not matter
if another session drops the constraint and inserts a NULL value as we
won't see that NULL value in our transaction... I'd assume that the
transaction has to start before it grabs the table defs that are required
for planning. Or have I got something wrong?

See the comment in check_functional_grouping:

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique
constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints
added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

I saw that, but I have to say I've not fully got my head around why that's
needed just yet.

The behavior you want seems somewhat similar to
check_functional_grouping; maybe you could unify it with your
targetListIsGuaranteedNotToHaveNulls at some level. (PS: that's one
ugly function name :)

Agreed :) Originally I had put the code that does that
in convert_ANY_sublink_to_join, but at the last minute before posting the
patch I decided that it might be useful and reusable so moved it out to
that function. I'll try and think of something better, but I'm open to
ideas.

Regards

David Rowley

#11David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#9)
Re: Allowing NOT IN to use ANTI joins

On Tue, Jun 10, 2014 at 2:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

If you are using NOT IN, then once you find a NULL in the outer input (if
the outer input is the in-list: clearly you can't reverse the two inputs

in

this case), you don't even need to finish reading the outer input, nor
start reading the inner input, because all rows are automatically

excluded

by the weird semantics of NOT IN.

The point I'm trying to make is that the goal of most join types is to
give an answer without having necessarily read all of either input.
For instance, if we tried to do this with a mergejoin it wouldn't work
reliably: it might suppose that it could accept an outer row on the basis
of no match in a higher-order sort column before it'd reached any nulls
appearing in lower-order sort columns.

You might be right that we could hot-wire the hash join case in
particular, but I'm failing to see the point of expending lots of extra
effort in order to deliver a useless answer faster. If there are NULLs
in the inner input, then NOT IN is simply the wrong query to make, and
giving an empty output in a relatively short amount of time isn't going
to help clue the user in on that. (If the SQL standard would let us do
so, I'd be arguing for throwing an error if we found a NULL.)

This got me thinking. It is probably a bit useless and wrong to perform a
NOT IN when the subquery in the IN() clause can have NULL values, so I
guess in any realistic useful case, the user will either have a NOT NULL
constraint on the columns, or they'll do a WHERE col IS NOT NULL, so I
should likely also allow a query such as:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

to also perform an ANTI JOIN. I think it's just a matter of
changing targetListIsGuaranteedNotToHaveNulls so that if it does not find
the NOT NULL constraint, to check the WHERE clause of the query to see if
there's any not null quals.

I'm about to put this to the test, but if it works then I think it should
cover many more cases for using NOT IN(), I guess we're only leaving out
function calls and calculations in the target list.

Regards

David Rowley

#12Marti Raudsepp
marti@juffo.org
In reply to: David Rowley (#10)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 11:53 AM, David Rowley <dgrowleyml@gmail.com> wrote:

The only way to consistently guarantee nullability is through primary
key constraints. Fortunately that addresses most of the use cases of
NOT IN(), in my experience.

See the comment in check_functional_grouping:

I saw that, but I have to say I've not fully got my head around why that's
needed just yet.

I was wrong, see Tom's reply to my email. It's OK to rely on
attnotnull for optimization decisions. The plan will be invalidated
automatically when the nullability of a referenced column changes.

check_functional_grouping needs special treatment because it decides
whether to accept/reject views; and if it has allowed creating a view,
it needs to guarantee that the dependent constraint isn't dropped for
a longer term.

as long as the transaction id
is taken before we start planning in session 1 then it should not matter if
another session drops the constraint and inserts a NULL value as we won't
see that NULL value in our transaction... I'd assume that the transaction
has to start before it grabs the table defs that are required for planning.
Or have I got something wrong?

1. You're assuming that query plans can only survive for the length of
a transaction. That's not true, prepared query plans can span many
transactions.

2. Also a FOR UPDATE clause can return values "from the future", if
another transaction has modified the value and already committed.

But this whole issue is moot anyway, the plan will get invalidated
when the nullability changes.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Marti Raudsepp
marti@juffo.org
In reply to: David Rowley (#1)
Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS
queries and leaves NOT IN alone. The reason for this is because the values
returned by a subquery in the IN clause could have NULLs.

There's a bug in targetListIsGuaranteedNotToHaveNulls, you have to
drill deeper into the query to guarantee the nullability of a result
column. If a table is OUTER JOINed, it can return NULLs even if the
original column specification has NOT NULL.

This test case produces incorrect results with your patch:

create table a (x int not null);
create table b (x int not null, y int not null);
insert into a values(1);
select * from a where x not in (select y from a left join b using (x));

Unpatched version correctly returns 0 rows since "y" will be NULL.
Your patch returns the value 1 from a.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14David Rowley
dgrowleyml@gmail.com
In reply to: Marti Raudsepp (#13)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 9:32 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT

EXISTS

queries and leaves NOT IN alone. The reason for this is because the

values

returned by a subquery in the IN clause could have NULLs.

There's a bug in targetListIsGuaranteedNotToHaveNulls, you have to
drill deeper into the query to guarantee the nullability of a result
column. If a table is OUTER JOINed, it can return NULLs even if the
original column specification has NOT NULL.

This test case produces incorrect results with your patch:

create table a (x int not null);
create table b (x int not null, y int not null);
insert into a values(1);
select * from a where x not in (select y from a left join b using (x));

Unpatched version correctly returns 0 rows since "y" will be NULL.
Your patch returns the value 1 from a.

Thanks, I actually was just looking at that. I guess I'll just need to make
sure that nothing in the targetlist comes from an outer join.

Regards

David Rowley

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marti Raudsepp (#12)
Re: Allowing NOT IN to use ANTI joins

Marti Raudsepp <marti@juffo.org> writes:

On Wed, Jun 11, 2014 at 11:53 AM, David Rowley <dgrowleyml@gmail.com> wrote:

as long as the transaction id
is taken before we start planning in session 1 then it should not matter if
another session drops the constraint and inserts a NULL value as we won't
see that NULL value in our transaction... I'd assume that the transaction
has to start before it grabs the table defs that are required for planning.
Or have I got something wrong?

1. You're assuming that query plans can only survive for the length of
a transaction. That's not true, prepared query plans can span many
transactions.

2. Also a FOR UPDATE clause can return values "from the future", if
another transaction has modified the value and already committed.

But this whole issue is moot anyway, the plan will get invalidated
when the nullability changes.

Right. The key point for David's concern is that we always hold (at
least) AccessShareLock on every relation used in a query, continuously
from rewrite through to the end of execution. This will block any attempt
by other transactions to make schema changes in the relation(s).
In the case of re-using a prepared plan, we re-acquire all these locks
and then check to see if we received any invalidation messages that
render the plan invalid; if not, we can proceed to execution with the same
safety guarantees as originally. (If we did, we replan starting from the
raw parse tree.)

If we didn't have mechanisms like this, we'd have far worse hazards from
ALTER TABLE than whether the planner made an incorrect join optimization.
Consider ALTER COLUMN TYPE for instance.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Greg Stark
stark@mit.edu
In reply to: Tom Lane (#15)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we didn't have mechanisms like this, we'd have far worse hazards from
ALTER TABLE than whether the planner made an incorrect join optimization.
Consider ALTER COLUMN TYPE for instance.

Obviously not general cases of ALTER COLUMN TYPE but dropping a NULL
constraint seems like the kind of change targeted by Simon's "reduce
lock strength" patch that I'm sure he's still interested in. I think
that patch, while full of dragons to steer around, is something that
will keep coming up again and again in the future. It's a huge
operational risk that even these short exclusive locks can cause a
huge production outage if they happen to get queued up behind a
reporting query.

I don't think it changes anything for this patch -- right now the
world is arranged the way Tom described -- but it's something to keep
in mind when we talk about lock strength reduction and the impact on
existing queries. For example if there's an UPDATE query in repeatable
read mode that has an IN clause like this and was optimized
accordingly then any lock strength reduction patch would have to
beware that an ALTER TABLE that dropped the NULL clause might impact
the update query.

Incidentally, Oracle has a feature for online schema changes that we
might end up having to implement something similar. The good news is
we have the infrastructure to maybe do it. The idea is to start
capturing all the changes to the table using something like our
logical changeset extraction. Then do the equivalent of "create
newtable as select ... from oldtable" to create the new schema, then
start replaying the accumulated changes to the new table. Eventually
when the change queue drains then get an exclusive lock, drain any new
changes, and swap in the new table with the new schema.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17David Rowley
dgrowleyml@gmail.com
In reply to: Marti Raudsepp (#13)
1 attachment(s)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 9:32 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT

EXISTS

queries and leaves NOT IN alone. The reason for this is because the

values

returned by a subquery in the IN clause could have NULLs.

There's a bug in targetListIsGuaranteedNotToHaveNulls, you have to
drill deeper into the query to guarantee the nullability of a result
column. If a table is OUTER JOINed, it can return NULLs even if the
original column specification has NOT NULL.

This test case produces incorrect results with your patch:

create table a (x int not null);
create table b (x int not null, y int not null);
insert into a values(1);
select * from a where x not in (select y from a left join b using (x));

Unpatched version correctly returns 0 rows since "y" will be NULL.
Your patch returns the value 1 from a.

I'm a bit stuck on fixing this and I can't quite figure out how I should
tell if the TargetEntry is coming from an outer join.

My first attempt does not work as it seems that I'm looking up the wrong
RangeTblEntry with the following:

rte = rt_fetch(tlevar->varno, query->rtable);

if (IS_OUTER_JOIN(rte->jointype))
return true; /* Var from an outer join */

The jointype returns JOIN_INNER when loooking up the RangeTblEntry from the
TargetEntry's varno. It seems that the RangeTblEntry that I need is stored
in query->rtable, but I've just no idea how to tell which item in the list
it is. So if anyone can point me in the right direction then that would be
really useful.

On a more positive or even slightly exciting note I think I've managed to
devise a way that ANTI JOINS can be used for NOT IN much more often. It
seems that find_nonnullable_vars will analyse a quals list to find
expressions that mean that the var cannot be NULL. This means we can
perform ANTI JOINS for NOT IN with queries like:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col = 1);
or
SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

(The attached patch implements this)

the nullable_col =1 will mean that nullable_col cannot be NULL, so the ANTI
JOIN can be performed safely. I think this combined with the NOT NULL check
will cover probably just about all valid uses of NOT IN with a subquery...
unless of course I've assumed something wrongly about
find_nonnullable_vars. I just need the correct RangeTblEntry in order to
determine if the TargetEntry is from an out join.

The attached patch is a broken implemention that still needs the lookup
code fixed to reference the correct RTE. The failing regression tests show
where the problems lie.

Any help on this would be really appreciated.

Regards

David Rowley

Attachments:

not_in_anti_join_v0.5_broken.patchapplication/octet-stream; name=not_in_anti_join_v0.5_broken.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index be92049..1d8614e 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -27,6 +27,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/subselect.h"
 #include "optimizer/var.h"
+#include "parser/parse_clause.h"
 #include "parser/parse_relation.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/builtins.h"
@@ -1176,7 +1177,7 @@ SS_process_ctes(PlannerInfo *root)
  */
 JoinExpr *
 convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
-							Relids available_rels)
+							bool under_not, Relids available_rels)
 {
 	JoinExpr   *result;
 	Query	   *parse = root->parse;
@@ -1191,6 +1192,15 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink->subLinkType == ANY_SUBLINK);
 
 	/*
+	 * For the case of NOT IN, we can only perform this optimization if we are
+	 * guaranteed that we'll never get NULL values from the subquery results.
+	 * For example "SELECT 1 WHERE 3 NOT IN(1,2,NULL)", the query will produce
+	 * 0 rows due to the fact that 3 <> NULL is unknown.
+	 */
+	if (under_not && queryTargetListCanHaveNulls(root, subselect))
+		return NULL;
+
+	/*
 	 * The sub-select must not refer to any Vars of the parent query. (Vars of
 	 * higher levels should be okay, though.)
 	 */
@@ -1256,7 +1266,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * And finally, build the JoinExpr node.
 	 */
 	result = makeNode(JoinExpr);
-	result->jointype = JOIN_SEMI;
+	result->jointype = under_not ? JOIN_ANTI : JOIN_SEMI;
 	result->isNatural = false;
 	result->larg = NULL;		/* caller must fill this in */
 	result->rarg = (Node *) rtr;
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 776fe42..7ea165d 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -334,7 +334,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 		/* Is it a convertible ANY or EXISTS clause? */
 		if (sublink->subLinkType == ANY_SUBLINK)
 		{
-			if ((j = convert_ANY_sublink_to_join(root, sublink,
+			if ((j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels1)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -360,7 +360,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 				return NULL;
 			}
 			if (available_rels2 != NULL &&
-				(j = convert_ANY_sublink_to_join(root, sublink,
+				(j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels2)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -452,7 +452,61 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 
 		if (sublink && IsA(sublink, SubLink))
 		{
-			if (sublink->subLinkType == EXISTS_SUBLINK)
+			if (sublink->subLinkType == ANY_SUBLINK)
+			{
+				if ((j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels1)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink1;
+					*jtlink1 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+				if (available_rels2 != NULL &&
+					(j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels2)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink2;
+					*jtlink2 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+			}
+			else if (sublink->subLinkType == EXISTS_SUBLINK)
 			{
 				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
 												   available_rels1)) != NULL)
@@ -506,6 +560,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 					return NULL;
 				}
 			}
+
 		}
 		/* Else return it unmodified */
 		return node;
diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index fcee137..7de2a52 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -21,6 +21,7 @@
 #include "commands/defrem.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/tlist.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
@@ -2406,6 +2407,115 @@ assignSortGroupRef(TargetEntry *tle, List *tlist)
 	return tle->ressortgroupref;
 }
 
+
+/*
+ * queryTargetListListCanHaveNulls
+ *		Checks if a query's targetList could produce NULL values. We only
+ *		return false if we can prove without question that no NULLs can exist
+ *		in all target list entries in the query.
+ *
+ * Note: resjunk targets are ignored here.
+ */
+bool
+queryTargetListCanHaveNulls(PlannerInfo *root, Query *query)
+{
+	List	 *local_nonnullable_vars;
+	bool	  computed_nonnullable_vars = false;
+	ListCell *tl;
+	Node	 *node;
+
+	/*
+	 * It should also be possible to determine if no NULLs can exist in the
+	 * results even when set operators are present in the query, but for now
+	 * we'll just report that NULLs are possible. It may be worth fixing this
+	 * up in the future, but at the time of writing this function, no call
+	 * sites existed which would call the function if the query contained set
+	 * operators.
+	 */
+	if (query->setOperations)
+		return true;
+
+
+	foreach(tl, query->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tl);
+
+		/* ignore columns which won't be in the final results */
+		if (tle->resjunk)
+			continue;
+
+		node = (Node *) tle->expr;
+
+		/* for Constant expressions we can tell directly if they're NULL or not. */
+		if (IsA(node, Const))
+		{
+			if (((Const *) node)->constisnull)
+				return true;
+		}
+
+		else if (IsA(node, Var))
+		{
+			ListCell	  *lc;
+			RangeTblEntry *rte;
+			bool		   matched;
+			Var			  *tlevar = (Var *) node;
+			//Relids		   joinrelids;
+			/*
+			 * Now let's look at the WHERE clause items to see if there's anything in
+			 * there that will disallow NULLs from the results.
+			 */
+			if (!computed_nonnullable_vars)
+			{
+				local_nonnullable_vars = find_nonnullable_vars(query->jointree->quals);
+				computed_nonnullable_vars = true;
+			}
+
+			matched = false;
+			foreach(lc, local_nonnullable_vars)
+			{
+				Var *var = (Var *) lfirst(lc);
+
+				if (var->varno == tlevar->varno && var->varattno == tlevar->varattno)
+				{
+					matched = true;
+					break;
+				}
+			}
+
+			/* Did we find the var non-nullable? */
+			if (matched)
+				continue; /* ok onto the next one */
+
+			/*
+			 * If we make it down here then we've been unable to find any quals
+			 * then ensure that the TargetEntry won't be NULL. All we have left
+			 * to go on is a NOT NULL constraint on the column, though this is
+			 * only going to prove that no nulls can be present if the relation
+			 * that the column belongs to is NOT from an OUTER JOIN.
+			 *
+			 * First we'll ensure we're dealing with a Var that's directly from
+			 * a table, then we'll ensure it's from an INNER JOIN type, finally
+			 * we'll look for a NOT NULL constraint on the column.
+			 */
+
+			if (!OidIsValid(tle->resorigtbl))
+				return true; /* Var is not from a table */
+
+			/* WRONG RangeTblEntry!!! */
+			rte =  rt_fetch(tlevar->varno, query->rtable);
+
+			if (IS_OUTER_JOIN(rte->jointype))
+				return true; /* Var from an outer join */
+
+			if (!get_attnotnull(tle->resorigtbl, tle->resorigcol))
+				return true; /* No NOT NULL constraint */
+		}
+		else
+			return true; /* not a Const or a Var */
+	}
+	return false; /* Cannot have NULLs */
+}
+
 /*
  * targetIsInSortList
  *		Is the given target item already in the sortlist?
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4b5ef99..c98adc7 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -816,6 +816,37 @@ get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum)
 /*				---------- ATTRIBUTE CACHES ----------					 */
 
 /*
+ * get_attnotnull
+ *		Returns true if pg_attribute.attnotnull is true, otherwise returns
+ *		false. An error is raised if no record is found for the relid/attnum.
+ *
+ * Note: Calling functions should be careful and test relid for InvalidOid
+ * before calling this function.
+ */
+bool
+get_attnotnull(Oid relid, AttrNumber attnum)
+{
+	HeapTuple	tp;
+
+	tp = SearchSysCache2(ATTNUM,
+						 ObjectIdGetDatum(relid),
+						 Int16GetDatum(attnum));
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+		bool result = att_tup->attnotnull;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+	{
+		elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+			attnum, relid);
+		return false; /* keep compiler quiet */
+	}
+}
+
+/*
  * get_attname
  *		Given the relation id and the attribute number,
  *		return the "attname" field from the attribute relation.
diff --git a/src/include/optimizer/subselect.h b/src/include/optimizer/subselect.h
index 5607e98..3e8bfe7 100644
--- a/src/include/optimizer/subselect.h
+++ b/src/include/optimizer/subselect.h
@@ -18,6 +18,7 @@
 extern void SS_process_ctes(PlannerInfo *root);
 extern JoinExpr *convert_ANY_sublink_to_join(PlannerInfo *root,
 							SubLink *sublink,
+							bool under_not,
 							Relids available_rels);
 extern JoinExpr *convert_EXISTS_sublink_to_join(PlannerInfo *root,
 							   SubLink *sublink,
diff --git a/src/include/parser/parse_clause.h b/src/include/parser/parse_clause.h
index e9e7cdc..8201574 100644
--- a/src/include/parser/parse_clause.h
+++ b/src/include/parser/parse_clause.h
@@ -15,6 +15,7 @@
 #define PARSE_CLAUSE_H
 
 #include "parser/parse_node.h"
+#include "nodes/relation.h"
 
 extern void transformFromClause(ParseState *pstate, List *frmList);
 extern int setTargetTable(ParseState *pstate, RangeVar *relation,
@@ -47,5 +48,6 @@ extern List *addTargetToSortList(ParseState *pstate, TargetEntry *tle,
 					bool resolveUnknown);
 extern Index assignSortGroupRef(TargetEntry *tle, List *tlist);
 extern bool targetIsInSortList(TargetEntry *tle, Oid sortop, List *sortList);
+extern bool queryTargetListCanHaveNulls(PlannerInfo *root, Query *query);
 
 #endif   /* PARSE_CLAUSE_H */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index f46460a..5e7d946 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -63,6 +63,7 @@ extern List *get_op_btree_interpretation(Oid opno);
 extern bool equality_ops_are_compatible(Oid opno1, Oid opno2);
 extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype,
 				  int16 procnum);
+extern bool get_attnotnull(Oid relid, AttrNumber attnum);
 extern char *get_attname(Oid relid, AttrNumber attnum);
 extern char *get_relid_attribute_name(Oid relid, AttrNumber attnum);
 extern AttrNumber get_attnum(Oid relid, const char *attname);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index ce15af7..970a224 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -739,3 +739,207 @@ select * from int4_tbl where
   0
 (1 row)
 
+--
+-- Check NOT IN performs ANTI JOIN when subquery columns are NOT NULL
+-- and does not when subquery columns can contain NULLs.
+--
+BEGIN;
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+-- ANTI JOIN. x is defined as NOT NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = b.x)
+   ->  Index Only Scan using a_pkey on a
+   ->  Sort
+         Sort Key: b.x
+         ->  Seq Scan on b
+(6 rows)
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+        QUERY PLAN         
+---------------------------
+ Nested Loop Anti Join
+   Join Filter: (a.id = 1)
+   ->  Seq Scan on a
+   ->  Materialize
+         ->  Seq Scan on b
+(5 rows)
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+          QUERY PLAN           
+-------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: (y = 1)
+(6 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+               QUERY PLAN               
+----------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: ((y = 1) OR (x = 1))
+(5 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: (((y = 1) OR (x = 1)) AND ((y = 2) OR (x = 2)))
+(5 rows)
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 2) AND ((y = 1) OR (x = 1)))
+(6 rows)
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+                 QUERY PLAN                 
+--------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 1) OR (y = 2))
+(6 rows)
+
+-- No ANTI JOIN c.z is from an outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Left Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+(12 rows)
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+                QUERY PLAN                 
+-------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = c.z)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Nested Loop
+               ->  Seq Scan on c
+                     Filter: (z = 1)
+               ->  Materialize
+                     ->  Seq Scan on b
+                           Filter: (x = 1)
+(10 rows)
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+                    QUERY PLAN                     
+---------------------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+                           Filter: (z IS NOT NULL)
+(13 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 326fd70..53ba798 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -422,3 +422,72 @@ select * from int4_tbl where
 select * from int4_tbl where
   (case when f1 in (select unique1 from tenk1 a) then f1 else null end) in
   (select ten from tenk1 b);
+
+--
+-- Check NOT IN performs ANTI JOIN when subquery columns are NOT NULL
+-- and does not when subquery columns can contain NULLs.
+--
+
+BEGIN;
+
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+
+-- ANTI JOIN. x is defined as NOT NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+
+-- No ANTI JOIN c.z is from an outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+
+ROLLBACK;
#18Simon Riggs
simon@2ndQuadrant.com
In reply to: David Rowley (#17)
Re: Allowing NOT IN to use ANTI joins

On 24 June 2014 11:32, David Rowley <dgrowleyml@gmail.com> wrote:

So if anyone can point me in the right direction then that would be
really useful.

Many things can be added simply, but most things can't. It seems we
just don't have that information. If we did, Tom would have done this
already.

On a more positive or even slightly exciting note I think I've managed to
devise a way that ANTI JOINS can be used for NOT IN much more often. It
seems that find_nonnullable_vars will analyse a quals list to find
expressions that mean that the var cannot be NULL. This means we can perform
ANTI JOINS for NOT IN with queries like:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col = 1);
or
SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

(The attached patch implements this)

the nullable_col =1 will mean that nullable_col cannot be NULL, so the ANTI
JOIN can be performed safely. I think this combined with the NOT NULL check
will cover probably just about all valid uses of NOT IN with a subquery...
unless of course I've assumed something wrongly about find_nonnullable_vars.
I just need the correct RangeTblEntry in order to determine if the
TargetEntry is from an out join.

This is the better way to go. It's much better to have explicit proof
its not null than a possibly long chain of metadata that might be
buggy.

The attached patch is a broken implemention that still needs the lookup code
fixed to reference the correct RTE. The failing regression tests show where
the problems lie.

Any help on this would be really appreciated.

I'd suggest we just drop the targetlist approach completely.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#16)
Re: Allowing NOT IN to use ANTI joins

On 11 June 2014 17:52, Greg Stark <stark@mit.edu> wrote:

On Wed, Jun 11, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we didn't have mechanisms like this, we'd have far worse hazards from
ALTER TABLE than whether the planner made an incorrect join optimization.
Consider ALTER COLUMN TYPE for instance.

Obviously not general cases of ALTER COLUMN TYPE but dropping a NULL
constraint seems like the kind of change targeted by Simon's "reduce
lock strength" patch that I'm sure he's still interested in. I think
that patch, while full of dragons to steer around, is something that
will keep coming up again and again in the future. It's a huge
operational risk that even these short exclusive locks can cause a
huge production outage if they happen to get queued up behind a
reporting query.

The focus of the lock strength reduction was around actions that lock
the table for extended periods. So it was mostly about adding things.
All the DROP actions are still AccessExclusiveLocks and will be for a
while.

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.
If we are relaxing a constraint, then a writable query that still
thinks a constraint exists won't cause a problem - it may error out
when it need not, but that's not so bad as to be worth worrying about.

So I think we can remove a NOT NULL constraint without too much problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#19)
Re: Allowing NOT IN to use ANTI joins

Simon Riggs <simon@2ndQuadrant.com> writes:

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.

mumble ... EvalPlanQual ?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#20)
Re: Allowing NOT IN to use ANTI joins

On 24 June 2014 23:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.

mumble ... EvalPlanQual ?

As long as we are relaxing a constraint, we are OK if an earlier
snapshot thinks its dealing with a tighter constraint whereas the new
reality is a relaxed constraint.

The worst that could happen is we hit an ERROR from a constraint that
was in force at the start of the query, so for consistency we really
should be enforcing the same constraint throughout the lifetime of the
query.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#21)
Re: Allowing NOT IN to use ANTI joins

Simon Riggs <simon@2ndQuadrant.com> writes:

On 24 June 2014 23:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.

mumble ... EvalPlanQual ?

As long as we are relaxing a constraint, we are OK if an earlier
snapshot thinks its dealing with a tighter constraint whereas the new
reality is a relaxed constraint.

I guess I should have been more explicit: EvalPlanQual processing could
see newer versions of tuples that might not satisfy the constraints the
plan was designed against. Now, this is true only for the tuple that's
the target of the UPDATE/DELETE, so it's possible you could prove that
there's no problem --- but it would take careful analysis of the specific
semantics of the constraints in question. I don't believe the argument
you've made here holds up.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#22)
Re: Allowing NOT IN to use ANTI joins

On 24 June 2014 23:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 24 June 2014 23:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.

mumble ... EvalPlanQual ?

As long as we are relaxing a constraint, we are OK if an earlier
snapshot thinks its dealing with a tighter constraint whereas the new
reality is a relaxed constraint.

I guess I should have been more explicit: EvalPlanQual processing could
see newer versions of tuples that might not satisfy the constraints the
plan was designed against. Now, this is true only for the tuple that's
the target of the UPDATE/DELETE, so it's possible you could prove that
there's no problem --- but it would take careful analysis of the specific
semantics of the constraints in question. I don't believe the argument
you've made here holds up.

OK, thanks for raising that. You're better at seeing these things than I.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#18)
Re: Allowing NOT IN to use ANTI joins

On 24 June 2014 23:22, Simon Riggs <simon@2ndquadrant.com> wrote:

On a more positive or even slightly exciting note I think I've managed to
devise a way that ANTI JOINS can be used for NOT IN much more often. It
seems that find_nonnullable_vars will analyse a quals list to find
expressions that mean that the var cannot be NULL. This means we can perform
ANTI JOINS for NOT IN with queries like:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col = 1);
or
SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

(The attached patch implements this)

the nullable_col =1 will mean that nullable_col cannot be NULL, so the ANTI
JOIN can be performed safely. I think this combined with the NOT NULL check
will cover probably just about all valid uses of NOT IN with a subquery...
unless of course I've assumed something wrongly about find_nonnullable_vars.
I just need the correct RangeTblEntry in order to determine if the
TargetEntry is from an out join.

This is the better way to go. It's much better to have explicit proof
its not null than a possibly long chain of metadata that might be
buggy.

The attached patch is a broken implemention that still needs the lookup code
fixed to reference the correct RTE. The failing regression tests show where
the problems lie.

Any help on this would be really appreciated.

I'd suggest we just drop the targetlist approach completely.

To be clearer, what I mean is we use only the direct proof approach,
for queries like this

SELECT * FROM a WHERE id NOT IN(SELECT unknown_col FROM b WHERE
unknown_col IS NOT NULL);

and we don't try to do it for queries like this

SELECT * FROM a WHERE id NOT IN(SELECT not_null_column FROM b);

because we don't know the full provenance of "not_null_column" in all
cases and that info is (currently) unreliable.

Once we have the transform working for one case, we can try to extend
the cases covered.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#24)
Re: Allowing NOT IN to use ANTI joins

Simon Riggs <simon@2ndQuadrant.com> writes:

To be clearer, what I mean is we use only the direct proof approach,
for queries like this

SELECT * FROM a WHERE id NOT IN(SELECT unknown_col FROM b WHERE
unknown_col IS NOT NULL);

and we don't try to do it for queries like this

SELECT * FROM a WHERE id NOT IN(SELECT not_null_column FROM b);

because we don't know the full provenance of "not_null_column" in all
cases and that info is (currently) unreliable.

FWIW, I think that would largely cripple the usefulness of the patch.
If you can tell people to rewrite their queries, you might as well
tell them to rewrite into NOT EXISTS. The real-world queries that
we're trying to improve invariably look like the latter case not the
former, because people who are running into this problem usually
aren't even thinking about the possibility of NULLs.

I would actually say that if we only have the bandwidth to get one of
these cases done, it should be the second one not the first. It's
unclear to me that checking for the first case would even be worth
the planner cycles it'd cost.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#25)
Re: Allowing NOT IN to use ANTI joins

On Thu, Jun 26, 2014 at 3:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

To be clearer, what I mean is we use only the direct proof approach,
for queries like this

SELECT * FROM a WHERE id NOT IN(SELECT unknown_col FROM b WHERE
unknown_col IS NOT NULL);

and we don't try to do it for queries like this

SELECT * FROM a WHERE id NOT IN(SELECT not_null_column FROM b);

because we don't know the full provenance of "not_null_column" in all
cases and that info is (currently) unreliable.

FWIW, I think that would largely cripple the usefulness of the patch.
If you can tell people to rewrite their queries, you might as well
tell them to rewrite into NOT EXISTS. The real-world queries that
we're trying to improve invariably look like the latter case not the
former, because people who are running into this problem usually
aren't even thinking about the possibility of NULLs.

At first I didn't quite agree with this, but after a bit more thought I'm
coming around to it.

I had been thinking that, quite often the subquery in the NOT IN would have
a WHERE clause and not just be accessing all rows of the table, but then
it's probably very likely that when the subquery *does* contain a WHERE
clause that it's for some completely different column.. It seems a bit
weird to write NOT IN(SELECT somecol FROM table WHERE somecol = 5) it's
probably more likely to be something like NOT IN(SELECT somecol FROM table
WHERE category = 5), i.e some column that's not in the target list, and in
this case we wouldn't know that somecol couldn't be NULL.

If there's no way to tell that the target entry comes from a left join,
then would it be a bit too quirky to just do the NOT NULL checking when
list_length(query->rtable) == 1 ? or maybe even loop over query->rtable and
abort if we find an outer join type? it seems a bit hackish, but perhaps it
would please more people than it would surprise.

Regards

David Rowley

Show quoted text

I would actually say that if we only have the bandwidth to get one of
these cases done, it should be the second one not the first. It's
unclear to me that checking for the first case would even be worth
the planner cycles it'd cost.

regards, tom lane

#27Simon Riggs
simon@2ndQuadrant.com
In reply to: David Rowley (#26)
Re: Allowing NOT IN to use ANTI joins

On 26 June 2014 10:31, David Rowley <dgrowleyml@gmail.com> wrote:

If there's no way to tell that the target entry comes from a left join, then
would it be a bit too quirky to just do the NOT NULL checking when
list_length(query->rtable) == 1 ? or maybe even loop over query->rtable and
abort if we find an outer join type? it seems a bit hackish, but perhaps it
would please more people than it would surprise.

We don't know enough about joins at present, so we only allow it when
there are no joins (i.e. length == 1). That's just a statement of
reality, not a hack.

I would agree with Tom that the common usage is to do NOT IN against a
table with no where clause, so this would hit the most frequent use
case.

Maybe Tom will have a flash of insight before commit, or maybe we
figure out a way to extend it later.

Let's document it and move on.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#26)
Re: Allowing NOT IN to use ANTI joins

David Rowley <dgrowleyml@gmail.com> writes:

If there's no way to tell that the target entry comes from a left join,
then would it be a bit too quirky to just do the NOT NULL checking when
list_length(query->rtable) == 1 ? or maybe even loop over query->rtable and
abort if we find an outer join type? it seems a bit hackish, but perhaps it
would please more people than it would surprise.

Why do you think you can't tell if the column is coming from the wrong
side of a left join?

I don't recall right now if there is already machinery in the planner for
this, but we could certainly deconstruct the from-clause enough to tell
that.

It's not hard to imagine a little function that recursively descends the
join tree, not bothering to descend into the nullable sides of outer
joins, and returns "true" if it finds a RangeTableRef for the desired RTE.
If it doesn't find the RTE in the not-nullable parts of the FROM clause,
then presumably it's nullable.

The only thing wrong with that approach is if you have to call the
function many times, in which case discovering the information from
scratch each time could get expensive.

I believe deconstruct_jointree already keeps track of whether it's
underneath an outer join, so if we were doing this later than that,
I'd advocate making sure it saves the needed information. But I suppose
you're trying to do this before that. It might be that you could
easily save aside similar information during the earlier steps in
prepjointree.c. (Sorry for not having examined the patch yet, else
I'd probably have a more concrete suggestion.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#28)
1 attachment(s)
Re: Allowing NOT IN to use ANTI joins

On Fri, Jun 27, 2014 at 6:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

If there's no way to tell that the target entry comes from a left join,
then would it be a bit too quirky to just do the NOT NULL checking when
list_length(query->rtable) == 1 ? or maybe even loop over query->rtable

and

abort if we find an outer join type? it seems a bit hackish, but perhaps

it

would please more people than it would surprise.

Why do you think you can't tell if the column is coming from the wrong
side of a left join?

It was more of that I couldn't figure out how to do it, rather than

thinking it was not possible.

I don't recall right now if there is already machinery in the planner for
this, but we could certainly deconstruct the from-clause enough to tell
that.

It's not hard to imagine a little function that recursively descends the
join tree, not bothering to descend into the nullable sides of outer
joins, and returns "true" if it finds a RangeTableRef for the desired RTE.
If it doesn't find the RTE in the not-nullable parts of the FROM clause,
then presumably it's nullable.

Ok, I've implemented that in the attached. Thanks for the tip.

The only thing wrong with that approach is if you have to call the
function many times, in which case discovering the information from
scratch each time could get expensive.

I ended up just having the function gather up all RelIds that are on the on
the inner side. So this will just need to be called once per NOT IN clause.

I believe deconstruct_jointree already keeps track of whether it's
underneath an outer join, so if we were doing this later than that,
I'd advocate making sure it saves the needed information. But I suppose
you're trying to do this before that. It might be that you could
easily save aside similar information during the earlier steps in
prepjointree.c. (Sorry for not having examined the patch yet, else
I'd probably have a more concrete suggestion.)

This is being done from within pull_up_sublinks, so it's before
deconstruct_jointree.

I've attached an updated version of the patch which seems to now work
properly with outer joins.

I think I'm finally ready for a review again, so I'll update the commitfest
app.

Regards

David Rowley

Attachments:

not_in_anti_join_v0.6.patchapplication/octet-stream; name=not_in_anti_join_v0.6.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..6a6adb2 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -27,6 +27,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/subselect.h"
 #include "optimizer/var.h"
+#include "parser/parse_clause.h"
 #include "parser/parse_relation.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/builtins.h"
@@ -1222,7 +1223,7 @@ SS_process_ctes(PlannerInfo *root)
  */
 JoinExpr *
 convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
-							Relids available_rels)
+							bool under_not, Relids available_rels)
 {
 	JoinExpr   *result;
 	Query	   *parse = root->parse;
@@ -1237,6 +1238,16 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink->subLinkType == ANY_SUBLINK);
 
 	/*
+	 * The SQL standard's requirements for handling of NULL values in a
+	 * NOT IN() condition requires that if a NULL appears within the NOT IN
+	 * condition that the whole condition is UNKNOWN, therefore FALSE. Here,
+	 * if we can be sure that the NOT IN condition will never produce any NULL
+	 * values, then we can allow this to become an ANTI JOIN.
+	 */
+	if (under_not && queryTargetListCanHaveNulls(subselect))
+		return NULL;
+
+	/*
 	 * The sub-select must not refer to any Vars of the parent query. (Vars of
 	 * higher levels should be okay, though.)
 	 */
@@ -1302,7 +1313,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * And finally, build the JoinExpr node.
 	 */
 	result = makeNode(JoinExpr);
-	result->jointype = JOIN_SEMI;
+	result->jointype = under_not ? JOIN_ANTI : JOIN_SEMI;
 	result->isNatural = false;
 	result->larg = NULL;		/* caller must fill this in */
 	result->rarg = (Node *) rtr;
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 9cb1378..c1d7091 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -334,7 +334,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 		/* Is it a convertible ANY or EXISTS clause? */
 		if (sublink->subLinkType == ANY_SUBLINK)
 		{
-			if ((j = convert_ANY_sublink_to_join(root, sublink,
+			if ((j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels1)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -360,7 +360,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 				return NULL;
 			}
 			if (available_rels2 != NULL &&
-				(j = convert_ANY_sublink_to_join(root, sublink,
+				(j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels2)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -452,7 +452,61 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 
 		if (sublink && IsA(sublink, SubLink))
 		{
-			if (sublink->subLinkType == EXISTS_SUBLINK)
+			if (sublink->subLinkType == ANY_SUBLINK)
+			{
+				if ((j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels1)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink1;
+					*jtlink1 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+				if (available_rels2 != NULL &&
+					(j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels2)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink2;
+					*jtlink2 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+			}
+			else if (sublink->subLinkType == EXISTS_SUBLINK)
 			{
 				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
 												   available_rels1)) != NULL)
@@ -506,6 +560,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 					return NULL;
 				}
 			}
+
 		}
 		/* Else return it unmodified */
 		return node;
diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index 4931dca..db50e62 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -21,6 +21,7 @@
 #include "commands/defrem.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/tlist.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
@@ -78,6 +79,8 @@ static int get_matching_location(int sortgroupref,
 static List *addTargetToGroupList(ParseState *pstate, TargetEntry *tle,
 					 List *grouplist, List *targetlist, int location,
 					 bool resolveUnknown);
+static Relids find_inner_rels(Query *query);
+static void find_inner_rels_walker(Node *jtnode, Relids *innerrels);
 static WindowClause *findWindowClause(List *wclist, const char *name);
 static Node *transformFrameOffset(ParseState *pstate, int frameOptions,
 					 Node *clause);
@@ -2406,6 +2409,180 @@ assignSortGroupRef(TargetEntry *tle, List *tlist)
 }
 
 /*
+ * find_inner_rels
+ *		Returns all relids in the query that are INNER JOIN rels.
+ *		Note that this function should only be used if this information
+ *		is required before deconstruct_jointree has been called.
+ */
+static Relids
+find_inner_rels(Query *query)
+{
+	Relids innerrels = NULL;
+
+	find_inner_rels_walker((Node *) query->jointree, &innerrels);
+
+	return innerrels;
+}
+
+/*
+ * find_inner_rels_walker
+ *		Worker function for find_inner_rels
+ */
+static void
+find_inner_rels_walker(Node *jtnode, Relids *innerrels)
+{
+	if (jtnode == NULL)
+	{
+		*innerrels = NULL;
+		return;
+	}
+	if (IsA(jtnode, RangeTblRef))
+	{
+		int			varno = ((RangeTblRef *) jtnode)->rtindex;
+		*innerrels = bms_add_member(*innerrels, varno);
+	}
+	else if (IsA(jtnode, FromExpr))
+	{
+		FromExpr   *f = (FromExpr *) jtnode;
+		ListCell   *l;
+
+		foreach(l, f->fromlist)
+		{
+			find_inner_rels_walker((Node *)lfirst(l), innerrels);
+		}
+	}
+	else if (IsA(jtnode, JoinExpr))
+	{
+		JoinExpr   *j = (JoinExpr *) jtnode;
+
+		if (j->jointype == JOIN_INNER)
+		{
+			find_inner_rels_walker(j->larg, innerrels);
+			find_inner_rels_walker(j->rarg, innerrels);
+		}
+	}
+}
+
+/*
+ * queryTargetListListCanHaveNulls
+ *		True if the logic in the function was unable to prove without doubt
+ *		that NULL values could not exist in the result set.
+ *
+ * Note: resjunk targets are ignored.
+ */
+bool
+queryTargetListCanHaveNulls(Query *query)
+{
+	List	 *local_nonnullable_vars;
+	bool	  computed_nonnullable_vars = false;
+	ListCell *tl;
+	Node	 *node;
+	Relids	  innerrels;
+
+	/*
+	 * It should also be possible to determine if no NULLs can exist in the
+	 * results even when set operators are present in the query, but for now
+	 * we'll just report that NULLs are possible. It may be worth fixing this
+	 * up in the future, but at the time of writing this function, no call
+	 * sites existed which would call the function if the query contained set
+	 * operators.
+	 */
+	if (query->setOperations)
+		return true;
+
+	/*
+	 * In the following loop we loop over each TargetEntry in the targetList
+	 * of the query with the aim to determine if a NULL value is impossible for
+	 * each TargetEntry. When doing this we must err on the side of caution,
+	 * it's ok for us to return True even if no NULL values do actually appear
+	 * in the final result set. We use the following methods to determine if
+	 * NULLs cannot exist:
+	 *
+	 * 1. If the TargetEntry is a Const, we can instantly tell if it's NULL
+	 *    or not.
+	 *
+	 * 2. If the Var comes from a relation and that relation has an INNER JOIN
+	 *    join type, we can lookup pg_attribute.attnotnull.
+	 *
+	 * 3. When not-nullness could not be proved by point 2 we may still be able
+	 *    to find a qual in the WHERE clause of the query that allows us to
+	 *    determine that a NULL will never be seen in the result set.
+	 *    For example the presense of; col IS NOT NULL, or col = 42 would allow
+	 *    us to determine that NULLs would not be possible in the result set.
+	 */
+
+	/* any rel not in this list must have an outer join type */
+	innerrels = find_inner_rels(query);
+
+	foreach(tl, query->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tl);
+
+		/* ignore columns which won't be in the final results */
+		if (tle->resjunk)
+			continue;
+
+		node = (Node *) tle->expr;
+
+		/* Check point 1: If the Const is NULL then report NULLs are possible. */
+		if (IsA(node, Const))
+		{
+			if (((Const *) node)->constisnull)
+				return true;
+		}
+
+		else if (IsA(node, Var))
+		{
+			ListCell	  *lc;
+			bool		   matched;
+			Var			  *tlevar = (Var *) node;
+
+			/* check point 2 */
+			if (OidIsValid(tle->resorigtbl) &&
+				bms_is_member(tlevar->varno, innerrels) &&
+				get_attnotnull(tle->resorigtbl, tle->resorigcol))
+				continue; /* cannot be NULL */
+
+			/* check point 3 */
+			if (!computed_nonnullable_vars)
+			{
+				/*
+				 * Analyzing the WHERE clause for not-nullable Vars likely is
+				 * a more expensive check, for this reason we do this last and
+				 * only do it once on the first time it is required.
+				 */
+				local_nonnullable_vars = find_nonnullable_vars(query->jointree->quals);
+				computed_nonnullable_vars = true;
+			}
+
+			matched = false;
+			foreach(lc, local_nonnullable_vars)
+			{
+				Var *var = (Var *) lfirst(lc);
+
+				if (var->varno == tlevar->varno &&
+					var->varattno == tlevar->varattno &&
+					var->varlevelsup == 0)
+				{
+					matched = true;
+					break;
+				}
+			}
+
+			/*
+			 * if check point 3 failed then we've run out of ways to determine
+			 * the nullability of the target entry, so we must return True.
+			 */
+			if (!matched)
+				return true;
+		}
+		else
+			return true; /* not a Const or a Var */
+	}
+	return false; /* Cannot have NULLs */
+}
+
+/*
  * targetIsInSortList
  *		Is the given target item already in the sortlist?
  *		If sortop is not InvalidOid, also test for a match to the sortop.
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4b5ef99..c98adc7 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -816,6 +816,37 @@ get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum)
 /*				---------- ATTRIBUTE CACHES ----------					 */
 
 /*
+ * get_attnotnull
+ *		Returns true if pg_attribute.attnotnull is true, otherwise returns
+ *		false. An error is raised if no record is found for the relid/attnum.
+ *
+ * Note: Calling functions should be careful and test relid for InvalidOid
+ * before calling this function.
+ */
+bool
+get_attnotnull(Oid relid, AttrNumber attnum)
+{
+	HeapTuple	tp;
+
+	tp = SearchSysCache2(ATTNUM,
+						 ObjectIdGetDatum(relid),
+						 Int16GetDatum(attnum));
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+		bool result = att_tup->attnotnull;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+	{
+		elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+			attnum, relid);
+		return false; /* keep compiler quiet */
+	}
+}
+
+/*
  * get_attname
  *		Given the relation id and the attribute number,
  *		return the "attname" field from the attribute relation.
diff --git a/src/include/optimizer/subselect.h b/src/include/optimizer/subselect.h
index 5607e98..3e8bfe7 100644
--- a/src/include/optimizer/subselect.h
+++ b/src/include/optimizer/subselect.h
@@ -18,6 +18,7 @@
 extern void SS_process_ctes(PlannerInfo *root);
 extern JoinExpr *convert_ANY_sublink_to_join(PlannerInfo *root,
 							SubLink *sublink,
+							bool under_not,
 							Relids available_rels);
 extern JoinExpr *convert_EXISTS_sublink_to_join(PlannerInfo *root,
 							   SubLink *sublink,
diff --git a/src/include/parser/parse_clause.h b/src/include/parser/parse_clause.h
index e9e7cdc..1bad04e 100644
--- a/src/include/parser/parse_clause.h
+++ b/src/include/parser/parse_clause.h
@@ -47,5 +47,6 @@ extern List *addTargetToSortList(ParseState *pstate, TargetEntry *tle,
 					bool resolveUnknown);
 extern Index assignSortGroupRef(TargetEntry *tle, List *tlist);
 extern bool targetIsInSortList(TargetEntry *tle, Oid sortop, List *sortList);
+extern bool queryTargetListCanHaveNulls(Query *query);
 
 #endif   /* PARSE_CLAUSE_H */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index f46460a..5e7d946 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -63,6 +63,7 @@ extern List *get_op_btree_interpretation(Oid opno);
 extern bool equality_ops_are_compatible(Oid opno1, Oid opno2);
 extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype,
 				  int16 procnum);
+extern bool get_attnotnull(Oid relid, AttrNumber attnum);
 extern char *get_attname(Oid relid, AttrNumber attnum);
 extern char *get_relid_attribute_name(Oid relid, AttrNumber attnum);
 extern AttrNumber get_attnum(Oid relid, const char *attname);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 0f070ef..35608ad 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -768,3 +768,207 @@ select nextval('ts1');
       11
 (1 row)
 
+--
+-- Check NOT IN performs ANTI JOIN when subquery columns are NOT NULL
+-- and does not when subquery columns can contain NULLs.
+--
+BEGIN;
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+-- ANTI JOIN. x is defined as NOT NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = b.x)
+   ->  Index Only Scan using a_pkey on a
+   ->  Sort
+         Sort Key: b.x
+         ->  Seq Scan on b
+(6 rows)
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+        QUERY PLAN         
+---------------------------
+ Nested Loop Anti Join
+   Join Filter: (a.id = 1)
+   ->  Seq Scan on a
+   ->  Materialize
+         ->  Seq Scan on b
+(5 rows)
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+          QUERY PLAN           
+-------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: (y = 1)
+(6 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+               QUERY PLAN               
+----------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: ((y = 1) OR (x = 1))
+(5 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: (((y = 1) OR (x = 1)) AND ((y = 2) OR (x = 2)))
+(5 rows)
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 2) AND ((y = 1) OR (x = 1)))
+(6 rows)
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+                 QUERY PLAN                 
+--------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 1) OR (y = 2))
+(6 rows)
+
+-- No ANTI JOIN c.z is from an outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Left Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+(12 rows)
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+                QUERY PLAN                 
+-------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = c.z)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Nested Loop
+               ->  Seq Scan on c
+                     Filter: (z = 1)
+               ->  Materialize
+                     ->  Seq Scan on b
+                           Filter: (x = 1)
+(10 rows)
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+                    QUERY PLAN                     
+---------------------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+                           Filter: (z IS NOT NULL)
+(13 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index b3fb03c..30ec0ab 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -435,3 +435,72 @@ select * from
   order by 1;
 
 select nextval('ts1');
+
+--
+-- Check NOT IN performs ANTI JOIN when subquery columns are NOT NULL
+-- and does not when subquery columns can contain NULLs.
+--
+
+BEGIN;
+
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+
+-- ANTI JOIN. x is defined as NOT NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+
+-- No ANTI JOIN c.z is from an outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+
+ROLLBACK;
#30Jeevan Chalke
jeevan.chalke@enterprisedb.com
In reply to: David Rowley (#29)
Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 29, 2014 at 4:18 PM, David Rowley <dgrowleyml@gmail.com> wrote:

I think I'm finally ready for a review again, so I'll update the
commitfest app.

I have reviewed this on code level.

1. Patch gets applied cleanly.
2. make/make install/make check all are fine

No issues found till now.

However some cosmetic points:

1.
* The API of this function is identical to convert_ANY_sublink_to_join's,
* except that we also support the case where the caller has found NOT
EXISTS,
* so we need an additional input parameter "under_not".

Since now, we do support NOT IN handling in convert_ANY_sublink_to_join() we
do have "under_not" parameter there too. So above comments near
convert_EXISTS_sublink_to_join() function is NO more valid.

2. Following is the unnecessary change. Can be removed:

@@ -506,6 +560,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node
*node,
return NULL;
}
}
+
}
/* Else return it unmodified */
return node;

3. Typos:

a.
+ * queryTargetListListCanHaveNulls
...
+queryTargetListCanHaveNulls(Query *query)

Function name is queryTargetListCanHaveNulls() not
queryTargetListListCanHaveNulls()

b.
* For example the presense of; col IS NOT NULL, or col = 42 would
allow

presense => presence

4. get_attnotnull() throws an error for invalid relid/attnum.
But I see other get_att* functions which do not throw an error rather
returning some invalid value, eg. NULL in case of attname, InvalidOid in
case of atttype and InvalidAttrNumber in case of attnum. I have observed
that we cannot return such invalid value for type boolean and I guess that's
the reason you are throwing an error. But somehow looks unusual. You had put
a note there which is good. But yes, caller of this function should be
careful enough and should not assume its working like other get_att*()
functions.
However for this patch, I would simply accept this solution. But I wonder if
someone thinks other way round.

Testing more on SQL level.

However, assigning it to author to think on above cosmetic issues.

Thanks

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#31David Rowley
dgrowleyml@gmail.com
In reply to: Jeevan Chalke (#30)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jul 2, 2014 at 9:25 PM, Jeevan Chalke <
jeevan.chalke@enterprisedb.com> wrote:

Testing more on SQL level.

I'm just looking into an issue I've found in the find_inner_rels()
function, where it does not properly find the rel in the from list in
certain cases, for example:

explain select * from a where id not in (select b.id from b left outer join
c on b.id=c.id);

fails to use an ANTI JOIN, but if you remove the left join to c, it works
perfectly.

Currently I'm just getting my head around how the jointree is structured
and reading over deconstruct_jointree to see how it handles this. I may
change the function to find_outer_rels and just look for outer joins in the
function.

However, assigning it to author to think on above cosmetic issues.

Thanks for the review. I'll fix the issues you listed soon, but I'll likely
delay posting the updated patch until I have the other fix in place.

Regards

David Rowley

Thanks

Show quoted text

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#32David Rowley
dgrowleyml@gmail.com
In reply to: Jeevan Chalke (#30)
1 attachment(s)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jul 2, 2014 at 9:25 PM, Jeevan Chalke <
jeevan.chalke@enterprisedb.com> wrote:

On Sun, Jun 29, 2014 at 4:18 PM, David Rowley <dgrowleyml@gmail.com>
wrote:

I think I'm finally ready for a review again, so I'll update the
commitfest app.

I have reviewed this on code level.

1. Patch gets applied cleanly.
2. make/make install/make check all are fine

No issues found till now.

However some cosmetic points:

1.
* The API of this function is identical to convert_ANY_sublink_to_join's,
* except that we also support the case where the caller has found NOT
EXISTS,
* so we need an additional input parameter "under_not".

Since now, we do support NOT IN handling in convert_ANY_sublink_to_join()
we
do have "under_not" parameter there too. So above comments near
convert_EXISTS_sublink_to_join() function is NO more valid.

Nice catch. I've removed the last 2 lines from that comment now.

2. Following is the unnecessary change. Can be removed:

@@ -506,6 +560,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node
*node,
return NULL;
}
}
+
}
/* Else return it unmodified */
return node;

Removed

3. Typos:

a.
+ * queryTargetListListCanHaveNulls
...
+queryTargetListCanHaveNulls(Query *query)

Function name is queryTargetListCanHaveNulls() not
queryTargetListListCanHaveNulls()

Fixed.

b.
* For example the presense of; col IS NOT NULL, or col = 42 would
allow

presense => presence

Fixed.

4. get_attnotnull() throws an error for invalid relid/attnum.
But I see other get_att* functions which do not throw an error rather
returning some invalid value, eg. NULL in case of attname, InvalidOid in
case of atttype and InvalidAttrNumber in case of attnum. I have observed
that we cannot return such invalid value for type boolean and I guess
that's
the reason you are throwing an error. But somehow looks unusual. You had
put
a note there which is good. But yes, caller of this function should be
careful enough and should not assume its working like other get_att*()
functions.
However for this patch, I would simply accept this solution. But I wonder
if
someone thinks other way round.

hmmm, I remember thinking about that at the time. It was a choice between
returning false or throwing an error. I decided that it probably should be
an error, as that's what get_relid_attribute_name() was doing. Just search
lsyscache.c for "cache lookup failed for attribute %d of relation %u".

Testing more on SQL level.

Thank you for reviewing this. I think in the attached I've managed to nail
down the logic in find_inner_rels(). It was actually more simple than I
thought. On working on this today I also noticed that RIGHT joins can still
exist at this stage of planning and also that full joins are not
transformed. e.g: a FULL JOIN b ON a.id=b.id WHERE a.is IS NOT NULL would
later become a LEFT JOIN, but at this stage in planning, it's still a FULL
JOIN. I've added some new regression tests which test some of these join
types.

With further testing I noticed that the patch was not allowing ANTI joins
in cases like this:

explain select * from a where id not in(select x from b natural join c);

even if b.x and b.c were NOT NULL columns. This is because the TargetEntry
for x has a varno which belongs to neither b or c, it's actually the varno
for the join itself. I've added some code to detect this in
find_inner_rels(), but I'm not quite convinced yet that it's in the correct
place... I'm wondering if a future use for find_inner_rels() would need to
only see relations rather than join varnos. The code I added does get the
above query using ANTI joins, but it does have a bit of a weird quirk, in
that for it to perform an ANTI JOIN, b.x must be NOT NULL. If c.x is NOT
NULL and b.x is nullable, then there will be no ANTI JOIN. There must be
some code somewhere that chooses which of the 2 vars that should go into
the target list in for naturals joins.

The above got me thinking that the join conditions can also be used to
prove not nullness too. Take the query above as an example, x can never
actually be NULL, the condition b.x = c.x ensures that. I've only gone as
far as adding some comments which explain that this is a possible future
optimisation. I've not had time to look at this yet and I'd rather see the
patch go in without it than risk delaying this until the next commitfest...
Unless of course someone thinks it's just too weird a quirk to have in the
code.

Attached is the updated version of the patch.

Regards

David Rowley

However, assigning it to author to think on above cosmetic issues.

Show quoted text

Thanks

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachments:

not_in_anti_join_5257082_2014-07-05.patchapplication/octet-stream; name=not_in_anti_join_5257082_2014-07-05.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..7aaca33 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -27,6 +27,7 @@
 #include "optimizer/prep.h"
 #include "optimizer/subselect.h"
 #include "optimizer/var.h"
+#include "parser/parse_clause.h"
 #include "parser/parse_relation.h"
 #include "rewrite/rewriteManip.h"
 #include "utils/builtins.h"
@@ -1222,7 +1223,7 @@ SS_process_ctes(PlannerInfo *root)
  */
 JoinExpr *
 convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
-							Relids available_rels)
+							bool under_not, Relids available_rels)
 {
 	JoinExpr   *result;
 	Query	   *parse = root->parse;
@@ -1237,6 +1238,16 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink->subLinkType == ANY_SUBLINK);
 
 	/*
+	 * The SQL standard's requirements for handling of NULL values in a
+	 * NOT IN() condition requires that if a NULL appears within the NOT IN
+	 * condition that the whole condition is UNKNOWN, therefore FALSE. Here,
+	 * if we can be sure that the NOT IN condition will never produce any NULL
+	 * values, then we can allow this to become an ANTI JOIN.
+	 */
+	if (under_not && queryTargetListCanHaveNulls(subselect))
+		return NULL;
+
+	/*
 	 * The sub-select must not refer to any Vars of the parent query. (Vars of
 	 * higher levels should be okay, though.)
 	 */
@@ -1302,7 +1313,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * And finally, build the JoinExpr node.
 	 */
 	result = makeNode(JoinExpr);
-	result->jointype = JOIN_SEMI;
+	result->jointype = under_not ? JOIN_ANTI : JOIN_SEMI;
 	result->isNatural = false;
 	result->larg = NULL;		/* caller must fill this in */
 	result->rarg = (Node *) rtr;
@@ -1317,9 +1328,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 /*
  * convert_EXISTS_sublink_to_join: try to convert an EXISTS SubLink to a join
  *
- * The API of this function is identical to convert_ANY_sublink_to_join's,
- * except that we also support the case where the caller has found NOT EXISTS,
- * so we need an additional input parameter "under_not".
+ * The API of this function is identical to convert_ANY_sublink_to_join's.
  */
 JoinExpr *
 convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 9cb1378..055d741 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -334,7 +334,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 		/* Is it a convertible ANY or EXISTS clause? */
 		if (sublink->subLinkType == ANY_SUBLINK)
 		{
-			if ((j = convert_ANY_sublink_to_join(root, sublink,
+			if ((j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels1)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -360,7 +360,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 				return NULL;
 			}
 			if (available_rels2 != NULL &&
-				(j = convert_ANY_sublink_to_join(root, sublink,
+				(j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels2)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -452,7 +452,61 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 
 		if (sublink && IsA(sublink, SubLink))
 		{
-			if (sublink->subLinkType == EXISTS_SUBLINK)
+			if (sublink->subLinkType == ANY_SUBLINK)
+			{
+				if ((j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels1)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink1;
+					*jtlink1 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+				if (available_rels2 != NULL &&
+					(j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels2)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink2;
+					*jtlink2 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+			}
+			else if (sublink->subLinkType == EXISTS_SUBLINK)
 			{
 				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
 												   available_rels1)) != NULL)
diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index 4931dca..a2ab6bf 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -21,6 +21,7 @@
 #include "commands/defrem.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/clauses.h"
 #include "optimizer/tlist.h"
 #include "parser/analyze.h"
 #include "parser/parsetree.h"
@@ -78,6 +79,8 @@ static int get_matching_location(int sortgroupref,
 static List *addTargetToGroupList(ParseState *pstate, TargetEntry *tle,
 					 List *grouplist, List *targetlist, int location,
 					 bool resolveUnknown);
+static Relids find_inner_rels(Query *query);
+static void find_inner_rels_walker(Node *jtnode, Relids *innerrels);
 static WindowClause *findWindowClause(List *wclist, const char *name);
 static Node *transformFrameOffset(ParseState *pstate, int frameOptions,
 					 Node *clause);
@@ -2406,6 +2409,231 @@ assignSortGroupRef(TargetEntry *tle, List *tlist)
 }
 
 /*
+ * find_inner_rels
+ *		Recursively traverse the jointree to seek out all relations which are
+ *		INNER JOIN relations.
+ *
+ *		Note: This function should only be used if this information
+ *		is required before deconstruct_jointree has been called.
+ */
+static Relids
+find_inner_rels(Query *query)
+{
+	Relids innerrels = NULL;
+
+	find_inner_rels_walker((Node *) query->jointree, &innerrels);
+
+	return innerrels;
+}
+
+/*
+ * find_inner_rels_walker
+ *		Worker function for find_inner_rels
+ */
+static void
+find_inner_rels_walker(Node *jtnode, Relids *inner_join_rels)
+{
+	if (jtnode == NULL)
+	{
+		*inner_join_rels = NULL;
+		return;
+	}
+	if (IsA(jtnode, RangeTblRef))
+	{
+		int varno = ((RangeTblRef *) jtnode)->rtindex;
+		*inner_join_rels = bms_add_member(*inner_join_rels, varno);
+	}
+	else if (IsA(jtnode, FromExpr))
+	{
+		FromExpr   *f = (FromExpr *) jtnode;
+		ListCell   *l;
+
+		foreach(l, f->fromlist)
+			find_inner_rels_walker((Node *)lfirst(l), inner_join_rels);
+	}
+	else if (IsA(jtnode, JoinExpr))
+	{
+		JoinExpr   *j = (JoinExpr *) jtnode;
+
+		/*
+		 * As we're only interested in finding inner relations we need not
+		 * recursively traverse down outer branches of the jointree.
+		 * For INNER join types we must look for inner relations on both sides
+		 * of the join. For LEFT, ANTI and SEMI joins we need only look on the
+		 * left hand side of the join tree. Right joins still can exist at this
+		 * early stage of planning so here we naturally just need to find inner
+		 * rels from the right side of the join. For FULL OUTER join types both
+		 * sides are outer, so we need not look any further for these.
+		 */
+
+		switch (j->jointype)
+		{
+			case JOIN_INNER:
+				find_inner_rels_walker(j->larg, inner_join_rels);
+				find_inner_rels_walker(j->rarg, inner_join_rels);
+
+				/*
+				 * NATURAL joins are a bit of a special case as TargetEntrys
+				 * will have the varno set to the rtindex of the JoinExpr from
+				 * which they are from. The NATURAL join type is actually an
+				 * inner join so we can just add the rtindex of this join to
+				 * the known inner rels.
+				 */
+				if (j->isNatural)
+					*inner_join_rels = bms_add_member(*inner_join_rels, j->rtindex);
+				break;
+
+			case JOIN_LEFT:
+			case JOIN_ANTI:
+			case JOIN_SEMI:
+				find_inner_rels_walker(j->larg, inner_join_rels);
+				break;
+
+			case JOIN_RIGHT:
+				find_inner_rels_walker(j->rarg, inner_join_rels);
+				break;
+
+			case JOIN_FULL:
+				break;
+
+			default:
+				elog(ERROR, "unrecognized join type: %d",
+					(int) j->jointype);
+		}
+	}
+	else
+	{
+		elog(ERROR, "unrecognized node type: %d",
+			(int) nodeTag(jtnode));
+	}
+}
+
+/*
+ * queryTargetListCanHaveNulls
+ *		True if the logic in the function was unable to prove without doubt
+ *		that NULL values could not exist in the result set.
+ *
+ * Note: resjunk targets are ignored.
+ */
+bool
+queryTargetListCanHaveNulls(Query *query)
+{
+	List	 *local_nonnullable_vars;
+	bool	  computed_nonnullable_vars = false;
+	ListCell *tl;
+	Node	 *node;
+	Relids	  innerrels;
+
+	/*
+	 * It should also be possible to determine if no NULLs can exist in the
+	 * results even when set operators are present in the query, but for now
+	 * we'll just report that NULLs are possible. It may be worth fixing this
+	 * up in the future, but at the time of writing this function, no call
+	 * sites existed which would call the function if the query contained set
+	 * operators.
+	 */
+	if (query->setOperations)
+		return true;
+
+	/*
+	 * In the following loop we loop over each TargetEntry in the targetList
+	 * of the query with the aim to determine if a NULL value is impossible for
+	 * each TargetEntry. When doing this we must err on the side of caution,
+	 * it's ok for us to return True even if no NULL values do actually appear
+	 * in the final result set. We use the following methods to determine if
+	 * NULLs cannot exist:
+	 *
+	 * 1. If the TargetEntry is a Const, we can instantly tell if it's NULL
+	 *    or not.
+	 *
+	 * 2. If the Var comes from a relation and that relation has an INNER JOIN
+	 *    join type, we can lookup pg_attribute.attnotnull.
+	 *
+	 * 3. When not-nullness could not be proved by point 2 we may still be able
+	 *    to find a qual in the WHERE clause of the query that allows us to
+	 *    determine that a NULL will never be seen in the result set.
+	 *    For example the presence of; col IS NOT NULL, or col = 42 would allow
+	 *    us to determine that NULLs would not be possible in the result set.
+	 *
+	 * Another possible way that we could prove not nullness would be to look
+	 * at join conditions. A query such as:
+	 * "select a.x from a inner join b on a.x = b.x" could never produce any
+	 * NULL valued a.x's due to the join condition filtering out NULLs. It may
+	 * be worth looking into doing this in the future.
+	 */
+
+	/* any rel not in this list must have an outer join type */
+	innerrels = find_inner_rels(query);
+
+	foreach(tl, query->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tl);
+
+		/* ignore columns which won't be in the final results */
+		if (tle->resjunk)
+			continue;
+
+		node = (Node *) tle->expr;
+
+		/* Check point 1: If the Const is NULL then report NULLs are possible. */
+		if (IsA(node, Const))
+		{
+			if (((Const *) node)->constisnull)
+				return true;
+		}
+
+		else if (IsA(node, Var))
+		{
+			ListCell	  *lc;
+			bool		   matched;
+			Var			  *tlevar = (Var *) node;
+
+			/* check point 2 */
+			if (OidIsValid(tle->resorigtbl) &&
+				bms_is_member(tlevar->varno, innerrels) &&
+				get_attnotnull(tle->resorigtbl, tle->resorigcol))
+				continue; /* cannot be NULL */
+
+			/* check point 3 */
+			if (!computed_nonnullable_vars)
+			{
+				/*
+				 * Analyzing the WHERE clause for not-nullable Vars likely is
+				 * a more expensive check, for this reason we do this last and
+				 * only do it once on the first time it is required.
+				 */
+				local_nonnullable_vars = find_nonnullable_vars(query->jointree->quals);
+				computed_nonnullable_vars = true;
+			}
+
+			matched = false;
+			foreach(lc, local_nonnullable_vars)
+			{
+				Var *var = (Var *) lfirst(lc);
+
+				if (var->varno == tlevar->varno &&
+					var->varattno == tlevar->varattno &&
+					var->varlevelsup == 0)
+				{
+					matched = true;
+					break;
+				}
+			}
+
+			/*
+			 * if check point 3 failed then we've run out of ways to determine
+			 * the nullability of the target entry, so we must return True.
+			 */
+			if (!matched)
+				return true;
+		}
+		else
+			return true; /* not a Const or a Var */
+	}
+	return false; /* Cannot have NULLs */
+}
+
+/*
  * targetIsInSortList
  *		Is the given target item already in the sortlist?
  *		If sortop is not InvalidOid, also test for a match to the sortop.
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4b5ef99..c98adc7 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -816,6 +816,37 @@ get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum)
 /*				---------- ATTRIBUTE CACHES ----------					 */
 
 /*
+ * get_attnotnull
+ *		Returns true if pg_attribute.attnotnull is true, otherwise returns
+ *		false. An error is raised if no record is found for the relid/attnum.
+ *
+ * Note: Calling functions should be careful and test relid for InvalidOid
+ * before calling this function.
+ */
+bool
+get_attnotnull(Oid relid, AttrNumber attnum)
+{
+	HeapTuple	tp;
+
+	tp = SearchSysCache2(ATTNUM,
+						 ObjectIdGetDatum(relid),
+						 Int16GetDatum(attnum));
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+		bool result = att_tup->attnotnull;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+	{
+		elog(ERROR, "cache lookup failed for attribute %d of relation %u",
+			attnum, relid);
+		return false; /* keep compiler quiet */
+	}
+}
+
+/*
  * get_attname
  *		Given the relation id and the attribute number,
  *		return the "attname" field from the attribute relation.
diff --git a/src/include/optimizer/subselect.h b/src/include/optimizer/subselect.h
index 5607e98..3e8bfe7 100644
--- a/src/include/optimizer/subselect.h
+++ b/src/include/optimizer/subselect.h
@@ -18,6 +18,7 @@
 extern void SS_process_ctes(PlannerInfo *root);
 extern JoinExpr *convert_ANY_sublink_to_join(PlannerInfo *root,
 							SubLink *sublink,
+							bool under_not,
 							Relids available_rels);
 extern JoinExpr *convert_EXISTS_sublink_to_join(PlannerInfo *root,
 							   SubLink *sublink,
diff --git a/src/include/parser/parse_clause.h b/src/include/parser/parse_clause.h
index e9e7cdc..1bad04e 100644
--- a/src/include/parser/parse_clause.h
+++ b/src/include/parser/parse_clause.h
@@ -47,5 +47,6 @@ extern List *addTargetToSortList(ParseState *pstate, TargetEntry *tle,
 					bool resolveUnknown);
 extern Index assignSortGroupRef(TargetEntry *tle, List *tlist);
 extern bool targetIsInSortList(TargetEntry *tle, Oid sortop, List *sortList);
+extern bool queryTargetListCanHaveNulls(Query *query);
 
 #endif   /* PARSE_CLAUSE_H */
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index f46460a..5e7d946 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -63,6 +63,7 @@ extern List *get_op_btree_interpretation(Oid opno);
 extern bool equality_ops_are_compatible(Oid opno1, Oid opno2);
 extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype,
 				  int16 procnum);
+extern bool get_attnotnull(Oid relid, AttrNumber attnum);
 extern char *get_attname(Oid relid, AttrNumber attnum);
 extern char *get_relid_attribute_name(Oid relid, AttrNumber attnum);
 extern AttrNumber get_attnum(Oid relid, const char *attname);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 0f070ef..6699c84 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -768,3 +768,301 @@ select nextval('ts1');
       11
 (1 row)
 
+--
+-- Check NOT IN performs an ANTI JOIN when NULLs are not possible
+-- in the target list of the subquery.
+--
+BEGIN;
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+-- ANTI JOIN. x is defined as NOT NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = b.x)
+   ->  Index Only Scan using a_pkey on a
+   ->  Sort
+         Sort Key: b.x
+         ->  Seq Scan on b
+(6 rows)
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+        QUERY PLAN         
+---------------------------
+ Nested Loop Anti Join
+   Join Filter: (a.id = 1)
+   ->  Seq Scan on a
+   ->  Materialize
+         ->  Seq Scan on b
+(5 rows)
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+          QUERY PLAN           
+-------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: (y = 1)
+(6 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+               QUERY PLAN               
+----------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: ((y = 1) OR (x = 1))
+(5 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: (((y = 1) OR (x = 1)) AND ((y = 2) OR (x = 2)))
+(5 rows)
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 2) AND ((y = 1) OR (x = 1)))
+(6 rows)
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+                 QUERY PLAN                 
+--------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 1) OR (y = 2))
+(6 rows)
+
+-- No ANTI JOIN c.z is from a left outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Left Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- ANTI JOIN, c.z is not from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Merge Left Join
+   Merge Cond: (c.z = b.x)
+   ->  Sort
+         Sort Key: c.z
+         ->  Merge Anti Join
+               Merge Cond: (a.id = c.z)
+               ->  Index Only Scan using a_pkey on a
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+   ->  Sort
+         Sort Key: b.x
+         ->  Seq Scan on b
+(13 rows)
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Right Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- No ANTI JOIN, c.z is not from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Full Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b FULL JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Full Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+(12 rows)
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+                QUERY PLAN                 
+-------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = c.z)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Nested Loop
+               ->  Seq Scan on c
+                     Filter: (z = 1)
+               ->  Materialize
+                     ->  Seq Scan on b
+                           Filter: (x = 1)
+(10 rows)
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+                    QUERY PLAN                     
+---------------------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+                           Filter: (z IS NOT NULL)
+(13 rows)
+
+ALTER TABLE c ADD COLUMN x INT;
+-- ANTI JOIN, x cannot be NULL as b.x has a NOT NULL constraint
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b NATURAL JOIN c);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = b.x)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.x)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.x
+                     ->  Seq Scan on c
+(12 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index b3fb03c..be748db 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -435,3 +435,95 @@ select * from
   order by 1;
 
 select nextval('ts1');
+
+--
+-- Check NOT IN performs an ANTI JOIN when NULLs are not possible
+-- in the target list of the subquery.
+--
+
+BEGIN;
+
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+
+-- ANTI JOIN. x is defined as NOT NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+
+-- No ANTI JOIN c.z is from a left outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z is not from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
+
+-- No ANTI JOIN, c.z is not from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b FULL JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+
+ALTER TABLE c ADD COLUMN x INT;
+
+-- ANTI JOIN, x cannot be NULL as b.x has a NOT NULL constraint
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x FROM b NATURAL JOIN c);
+
+
+ROLLBACK;
#33Jeevan Chalke
jeevan.chalke@enterprisedb.com
In reply to: David Rowley (#32)
Re: Allowing NOT IN to use ANTI joins

Hi,

With further testing I noticed that the patch was not allowing ANTI joins
in cases like this:

explain select * from a where id not in(select x from b natural join c);

I too found this with natural joins and was about to report that. But its
good that you found that and fixed it as well.

I have reviewed this and didn't find any issues as such. So marking it
"Ready for Committer".

Thanks

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#32)
1 attachment(s)
Re: Allowing NOT IN to use ANTI joins

David Rowley <dgrowleyml@gmail.com> writes:

Attached is the updated version of the patch.

I spent some time fooling with this patch, cleaning up the join-alias
issue as well as more-cosmetic things. However, while testing it
I realized that the whole thing is based on a false premise: to equate
a NOT IN with an antijoin, we'd have to prove not only that the inner
side is non-nullable, but that the outer comparison values are too.
Here's an example:

regression=# create table zt1 (f1 int);
CREATE TABLE
regression=# insert into zt1 values(1);
INSERT 0 1
regression=# insert into zt1 values(2);
INSERT 0 1
regression=# insert into zt1 values(null);
INSERT 0 1
regression=# create table zt2 (f1 int not null);
CREATE TABLE
regression=# insert into zt2 values(1);
INSERT 0 1

With the patch in place, we get

regression=# explain select * from zt1 where f1 not in (select f1 from zt2);
QUERY PLAN
-------------------------------------------------------------------
Hash Anti Join (cost=64.00..144.80 rows=1200 width=4)
Hash Cond: (zt1.f1 = zt2.f1)
-> Seq Scan on zt1 (cost=0.00..34.00 rows=2400 width=4)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on zt2 (cost=0.00..34.00 rows=2400 width=4)
Planning time: 0.646 ms
(6 rows)

regression=# select * from zt1 where f1 not in (select f1 from zt2);
f1
----
2

(2 rows)

which is the wrong answer, as demonstrated by comparison to the result
without optimization:

regression=# alter table zt2 alter column f1 drop not null;
ALTER TABLE
regression=# explain select * from zt1 where f1 not in (select f1 from zt2);
QUERY PLAN
---------------------------------------------------------------
Seq Scan on zt1 (cost=40.00..80.00 rows=1200 width=4)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on zt2 (cost=0.00..34.00 rows=2400 width=4)
Planning time: 0.163 ms
(5 rows)

regression=# select * from zt1 where f1 not in (select f1 from zt2);
f1
----
2
(1 row)

We could no doubt fix this by also insisting that the left-side vars
be provably not null, but that's going to make the patch even slower
and even less often applicable. I'm feeling discouraged about whether
this is worth doing in this form.

For the archives' sake, I attach an updated version with the fixes
I'd applied so far.

regards, tom lane

Attachments:

not_in_anti_join_v0.7.patchtext/x-diff; charset=us-ascii; name=not_in_anti_join_v0.7.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..4b44662 100644
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
*************** SS_process_ctes(PlannerInfo *root)
*** 1195,1205 ****
   * If so, form a JoinExpr and return it.  Return NULL if the SubLink cannot
   * be converted to a join.
   *
!  * The only non-obvious input parameter is available_rels: this is the set
!  * of query rels that can safely be referenced in the sublink expression.
!  * (We must restrict this to avoid changing the semantics when a sublink
!  * is present in an outer join's ON qual.)  The conversion must fail if
!  * the converted qual would reference any but these parent-query relids.
   *
   * On success, the returned JoinExpr has larg = NULL and rarg = the jointree
   * item representing the pulled-up subquery.  The caller must set larg to
--- 1195,1208 ----
   * If so, form a JoinExpr and return it.  Return NULL if the SubLink cannot
   * be converted to a join.
   *
!  * If under_not is true, the caller actually found NOT (ANY SubLink),
!  * so that what we must try to build is an ANTI not SEMI join.
!  *
!  * available_rels is the set of query rels that can safely be referenced
!  * in the sublink expression.  (We must restrict this to avoid changing the
!  * semantics when a sublink is present in an outer join's ON qual.)
!  * The conversion must fail if the converted qual would reference any but
!  * these parent-query relids.
   *
   * On success, the returned JoinExpr has larg = NULL and rarg = the jointree
   * item representing the pulled-up subquery.  The caller must set larg to
*************** SS_process_ctes(PlannerInfo *root)
*** 1222,1228 ****
   */
  JoinExpr *
  convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
! 							Relids available_rels)
  {
  	JoinExpr   *result;
  	Query	   *parse = root->parse;
--- 1225,1231 ----
   */
  JoinExpr *
  convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
! 							bool under_not, Relids available_rels)
  {
  	JoinExpr   *result;
  	Query	   *parse = root->parse;
*************** convert_ANY_sublink_to_join(PlannerInfo 
*** 1237,1242 ****
--- 1240,1254 ----
  	Assert(sublink->subLinkType == ANY_SUBLINK);
  
  	/*
+ 	 * Per SQL spec, NOT IN is not ordinarily equivalent to an anti-join, so
+ 	 * that by default we have to fail when under_not.  However, if we can
+ 	 * prove that the sub-select's output columns are all certainly not NULL,
+ 	 * then it's safe to convert NOT IN to an anti-join.
+ 	 */
+ 	if (under_not && !query_outputs_are_not_nullable(subselect))
+ 		return NULL;
+ 
+ 	/*
  	 * The sub-select must not refer to any Vars of the parent query. (Vars of
  	 * higher levels should be okay, though.)
  	 */
*************** convert_ANY_sublink_to_join(PlannerInfo 
*** 1302,1308 ****
  	 * And finally, build the JoinExpr node.
  	 */
  	result = makeNode(JoinExpr);
! 	result->jointype = JOIN_SEMI;
  	result->isNatural = false;
  	result->larg = NULL;		/* caller must fill this in */
  	result->rarg = (Node *) rtr;
--- 1314,1320 ----
  	 * And finally, build the JoinExpr node.
  	 */
  	result = makeNode(JoinExpr);
! 	result->jointype = under_not ? JOIN_ANTI : JOIN_SEMI;
  	result->isNatural = false;
  	result->larg = NULL;		/* caller must fill this in */
  	result->rarg = (Node *) rtr;
*************** convert_ANY_sublink_to_join(PlannerInfo 
*** 1317,1325 ****
  /*
   * convert_EXISTS_sublink_to_join: try to convert an EXISTS SubLink to a join
   *
!  * The API of this function is identical to convert_ANY_sublink_to_join's,
!  * except that we also support the case where the caller has found NOT EXISTS,
!  * so we need an additional input parameter "under_not".
   */
  JoinExpr *
  convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
--- 1329,1335 ----
  /*
   * convert_EXISTS_sublink_to_join: try to convert an EXISTS SubLink to a join
   *
!  * The API of this function is identical to convert_ANY_sublink_to_join's.
   */
  JoinExpr *
  convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 9cb1378..3a116c7 100644
*** a/src/backend/optimizer/prep/prepjointree.c
--- b/src/backend/optimizer/prep/prepjointree.c
*************** pull_up_sublinks_qual_recurse(PlannerInf
*** 334,340 ****
  		/* Is it a convertible ANY or EXISTS clause? */
  		if (sublink->subLinkType == ANY_SUBLINK)
  		{
! 			if ((j = convert_ANY_sublink_to_join(root, sublink,
  												 available_rels1)) != NULL)
  			{
  				/* Yes; insert the new join node into the join tree */
--- 334,340 ----
  		/* Is it a convertible ANY or EXISTS clause? */
  		if (sublink->subLinkType == ANY_SUBLINK)
  		{
! 			if ((j = convert_ANY_sublink_to_join(root, sublink, false,
  												 available_rels1)) != NULL)
  			{
  				/* Yes; insert the new join node into the join tree */
*************** pull_up_sublinks_qual_recurse(PlannerInf
*** 360,366 ****
  				return NULL;
  			}
  			if (available_rels2 != NULL &&
! 				(j = convert_ANY_sublink_to_join(root, sublink,
  												 available_rels2)) != NULL)
  			{
  				/* Yes; insert the new join node into the join tree */
--- 360,366 ----
  				return NULL;
  			}
  			if (available_rels2 != NULL &&
! 				(j = convert_ANY_sublink_to_join(root, sublink, false,
  												 available_rels2)) != NULL)
  			{
  				/* Yes; insert the new join node into the join tree */
*************** pull_up_sublinks_qual_recurse(PlannerInf
*** 445,458 ****
  	}
  	if (not_clause(node))
  	{
! 		/* If the immediate argument of NOT is EXISTS, try to convert */
  		SubLink    *sublink = (SubLink *) get_notclausearg((Expr *) node);
  		JoinExpr   *j;
  		Relids		child_rels;
  
  		if (sublink && IsA(sublink, SubLink))
  		{
! 			if (sublink->subLinkType == EXISTS_SUBLINK)
  			{
  				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
  												   available_rels1)) != NULL)
--- 445,512 ----
  	}
  	if (not_clause(node))
  	{
! 		/* If the immediate argument of NOT is ANY or EXISTS, try to convert */
  		SubLink    *sublink = (SubLink *) get_notclausearg((Expr *) node);
  		JoinExpr   *j;
  		Relids		child_rels;
  
  		if (sublink && IsA(sublink, SubLink))
  		{
! 			if (sublink->subLinkType == ANY_SUBLINK)
! 			{
! 				if ((j = convert_ANY_sublink_to_join(root, sublink, true,
! 												   available_rels1)) != NULL)
! 				{
! 					/* Yes; insert the new join node into the join tree */
! 					j->larg = *jtlink1;
! 					*jtlink1 = (Node *) j;
! 					/* Recursively process pulled-up jointree nodes */
! 					j->rarg = pull_up_sublinks_jointree_recurse(root,
! 																j->rarg,
! 																&child_rels);
! 
! 					/*
! 					 * Now recursively process the pulled-up quals.  Because
! 					 * we are underneath a NOT, we can't pull up sublinks that
! 					 * reference the left-hand stuff, but it's still okay to
! 					 * pull up sublinks referencing j->rarg.
! 					 */
! 					j->quals = pull_up_sublinks_qual_recurse(root,
! 															 j->quals,
! 															 &j->rarg,
! 															 child_rels,
! 															 NULL, NULL);
! 					/* Return NULL representing constant TRUE */
! 					return NULL;
! 				}
! 				if (available_rels2 != NULL &&
! 					(j = convert_ANY_sublink_to_join(root, sublink, true,
! 												   available_rels2)) != NULL)
! 				{
! 					/* Yes; insert the new join node into the join tree */
! 					j->larg = *jtlink2;
! 					*jtlink2 = (Node *) j;
! 					/* Recursively process pulled-up jointree nodes */
! 					j->rarg = pull_up_sublinks_jointree_recurse(root,
! 																j->rarg,
! 																&child_rels);
! 
! 					/*
! 					 * Now recursively process the pulled-up quals.  Because
! 					 * we are underneath a NOT, we can't pull up sublinks that
! 					 * reference the left-hand stuff, but it's still okay to
! 					 * pull up sublinks referencing j->rarg.
! 					 */
! 					j->quals = pull_up_sublinks_qual_recurse(root,
! 															 j->quals,
! 															 &j->rarg,
! 															 child_rels,
! 															 NULL, NULL);
! 					/* Return NULL representing constant TRUE */
! 					return NULL;
! 				}
! 			}
! 			else if (sublink->subLinkType == EXISTS_SUBLINK)
  			{
  				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
  												   available_rels1)) != NULL)
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 19b5cf7..f8e3eaa 100644
*** a/src/backend/optimizer/util/clauses.c
--- b/src/backend/optimizer/util/clauses.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "parser/parse_agg.h"
  #include "parser/parse_coerce.h"
  #include "parser/parse_func.h"
+ #include "parser/parsetree.h"
  #include "rewrite/rewriteManip.h"
  #include "tcop/tcopprot.h"
  #include "utils/acl.h"
*************** static bool contain_nonstrict_functions_
*** 100,105 ****
--- 101,108 ----
  static bool contain_leaky_functions_walker(Node *node, void *context);
  static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
  static List *find_nonnullable_vars_walker(Node *node, bool top_level);
+ static void find_innerjoined_rels(Node *jtnode,
+ 					  Relids *innerjoined_rels, List **usable_quals);
  static bool is_strict_saop(ScalarArrayOpExpr *expr, bool falseOK);
  static Node *eval_const_expressions_mutator(Node *node,
  							   eval_const_expressions_context *context);
*************** contain_leaky_functions_walker(Node *nod
*** 1459,1464 ****
--- 1462,1471 ----
  								  context);
  }
  
+ /*****************************************************************************
+  *		  Nullability analysis
+  *****************************************************************************/
+ 
  /*
   * find_nonnullable_rels
   *		Determine which base rels are forced nonnullable by given clause.
*************** find_nonnullable_rels_walker(Node *node,
*** 1685,1691 ****
   * but here we assume that the input is a Boolean expression, and wish to
   * see if NULL inputs will provably cause a FALSE-or-NULL result.  We expect
   * the expression to have been AND/OR flattened and converted to implicit-AND
!  * format.
   *
   * The result is a palloc'd List, but we have not copied the member Var nodes.
   * Also, we don't bother trying to eliminate duplicate entries.
--- 1692,1698 ----
   * but here we assume that the input is a Boolean expression, and wish to
   * see if NULL inputs will provably cause a FALSE-or-NULL result.  We expect
   * the expression to have been AND/OR flattened and converted to implicit-AND
!  * format (but the results are still good if it wasn't AND/OR flattened).
   *
   * The result is a palloc'd List, but we have not copied the member Var nodes.
   * Also, we don't bother trying to eliminate duplicate entries.
*************** find_forced_null_var(Node *node)
*** 1987,1992 ****
--- 1994,2241 ----
  }
  
  /*
+  * query_outputs_are_not_nullable
+  *		Returns TRUE if the output values of the Query are certainly not NULL.
+  *		All output columns must return non-NULL to answer TRUE.
+  *
+  * The reason this takes a Query, and not just an individual tlist expression,
+  * is so that we can make use of the query's WHERE/ON clauses to prove it does
+  * not return nulls.
+  *
+  * In current usage, the passed sub-Query hasn't yet been through any planner
+  * processing.  This means that applying find_nonnullable_vars() to its WHERE
+  * clauses isn't really ideal: for lack of const-simplification, we might be
+  * unable to prove not-nullness in some cases where we could have proved it
+  * afterwards.  However, we should not get any false positive results.
+  *
+  * Like the other forms of nullability analysis above, we can err on the
+  * side of conservatism: if we're not sure, it's okay to return FALSE.
+  */
+ bool
+ query_outputs_are_not_nullable(Query *query)
+ {
+ 	PlannerInfo subroot;
+ 	Relids		innerjoined_rels = NULL;
+ 	bool		computed_innerjoined_rels = false;
+ 	List	   *usable_quals = NIL;
+ 	List	   *nonnullable_vars = NIL;
+ 	bool		computed_nonnullable_vars = false;
+ 	ListCell   *tl;
+ 
+ 	/*
+ 	 * If the query contains set operations, punt.  The set ops themselves
+ 	 * couldn't introduce nulls that weren't in their inputs, but the tlist
+ 	 * present in the top-level query is just dummy and won't give us useful
+ 	 * info.  We could get an answer by recursing to examine each leaf query,
+ 	 * but for the moment it doesn't seem worth the extra complication.
+ 	 *
+ 	 * Note that we needn't consider other top-level operators such as
+ 	 * DISTINCT, GROUP BY, etc, as those will not introduce nulls either.
+ 	 */
+ 	if (query->setOperations)
+ 		return false;
+ 
+ 	/*
+ 	 * We need a PlannerInfo to pass to flatten_join_alias_vars.  Fortunately,
+ 	 * we can cons up an entirely dummy one, because only the "parse" link in
+ 	 * the struct is used by flatten_join_alias_vars.
+ 	 */
+ 	MemSet(&subroot, 0, sizeof(subroot));
+ 	subroot.parse = query;
+ 
+ 	/*
+ 	 * Examine each targetlist entry to prove that it can't produce NULL.
+ 	 */
+ 	foreach(tl, query->targetList)
+ 	{
+ 		TargetEntry *tle = (TargetEntry *) lfirst(tl);
+ 		Expr	   *expr = tle->expr;
+ 
+ 		/* Resjunk columns can be ignored: they don't produce output values */
+ 		if (tle->resjunk)
+ 			continue;
+ 
+ 		/*
+ 		 * For the most part we don't try to deal with anything more complex
+ 		 * than Consts and Vars; but it seems worthwhile to look through
+ 		 * binary relabelings, since we know those don't introduce nulls.
+ 		 */
+ 		while (expr && IsA(expr, RelabelType))
+ 			expr = ((RelabelType *) expr)->arg;
+ 
+ 		if (expr == NULL)		/* paranoia */
+ 			return false;
+ 
+ 		if (IsA(expr, Const))
+ 		{
+ 			/* Consts are easy: they're either null or not. */
+ 			if (((Const *) expr)->constisnull)
+ 				return false;
+ 		}
+ 		else if (IsA(expr, Var))
+ 		{
+ 			Var		   *var = (Var *) expr;
+ 
+ 			/* Currently, we punt for any nonlocal Vars */
+ 			if (var->varlevelsup != 0)
+ 				return false;
+ 
+ 			/*
+ 			 * Since the subquery hasn't yet been through expression
+ 			 * preprocessing, we must apply flatten_join_alias_vars to the
+ 			 * given Var, and to any Vars found by find_nonnullable_vars, to
+ 			 * avoid being fooled by join aliases.  If we get something other
+ 			 * than a plain Var out of the substitution, punt.
+ 			 */
+ 			var = (Var *) flatten_join_alias_vars(&subroot, (Node *) var);
+ 
+ 			if (!IsA(var, Var))
+ 				return false;
+ 			Assert(var->varlevelsup == 0);
+ 
+ 			/*
+ 			 * We don't bother to compute innerjoined_rels and usable_quals
+ 			 * until we've found a Var we must analyze.
+ 			 */
+ 			if (!computed_innerjoined_rels)
+ 			{
+ 				find_innerjoined_rels((Node *) query->jointree,
+ 									  &innerjoined_rels, &usable_quals);
+ 				computed_innerjoined_rels = true;
+ 			}
+ 
+ 			/*
+ 			 * If Var is from a plain relation, and that relation is not on
+ 			 * the nullable side of any outer join, and its column is marked
+ 			 * NOT NULL according to the catalogs, it can't produce NULL.
+ 			 */
+ 			if (bms_is_member(var->varno, innerjoined_rels))
+ 			{
+ 				RangeTblEntry *rte = rt_fetch(var->varno, query->rtable);
+ 
+ 				if (rte->rtekind == RTE_RELATION &&
+ 					get_attnotnull(rte->relid, var->varattno))
+ 					continue;	/* cannot produce NULL */
+ 			}
+ 
+ 			/*
+ 			 * Even if that didn't work, we can conclude that the Var is not
+ 			 * nullable if find_nonnullable_vars can find a "var IS NOT NULL"
+ 			 * or similarly strict condition among the usable_quals.  Compute
+ 			 * the list of Vars having such quals if we didn't already.
+ 			 */
+ 			if (!computed_nonnullable_vars)
+ 			{
+ 				nonnullable_vars = find_nonnullable_vars((Node *) usable_quals);
+ 				nonnullable_vars = (List *)
+ 					flatten_join_alias_vars(&subroot,
+ 											(Node *) nonnullable_vars);
+ 				/* We don't bother removing any non-Vars from the result */
+ 				computed_nonnullable_vars = true;
+ 			}
+ 
+ 			if (!list_member(nonnullable_vars, var))
+ 				return false;	/* we failed to prove the Var non-null */
+ 		}
+ 		else
+ 		{
+ 			/* Not a Const or Var; punt */
+ 			return false;
+ 		}
+ 	}
+ 
+ 	return true;				/* query cannot emit NULLs */
+ }
+ 
+ /*
+  * find_innerjoined_rels
+  *		Traverse jointree to locate non-outerjoined-rels and quals above them
+  *
+  * We fill innerjoined_rels with the relids of all rels that are not below
+  * the nullable side of any outer join (which would cause their Vars to be
+  * possibly NULL regardless of what's in the catalogs).  In the same scan,
+  * we locate all WHERE and JOIN/ON quals that constrain these rels, and add
+  * them to the usable_quals list (forming a list with implicit-AND semantics).
+  *
+  * Top-level caller must initialize innerjoined_rels/usable_quals to NULL/NIL.
+  */
+ static void
+ find_innerjoined_rels(Node *jtnode,
+ 					  Relids *innerjoined_rels, List **usable_quals)
+ {
+ 	if (jtnode == NULL)
+ 		return;
+ 	if (IsA(jtnode, RangeTblRef))
+ 	{
+ 		int			varno = ((RangeTblRef *) jtnode)->rtindex;
+ 
+ 		*innerjoined_rels = bms_add_member(*innerjoined_rels, varno);
+ 	}
+ 	else if (IsA(jtnode, FromExpr))
+ 	{
+ 		FromExpr   *f = (FromExpr *) jtnode;
+ 		ListCell   *lc;
+ 
+ 		/* All elements of the FROM list are allowable */
+ 		foreach(lc, f->fromlist)
+ 			find_innerjoined_rels((Node *) lfirst(lc),
+ 								  innerjoined_rels, usable_quals);
+ 		/* ... and its WHERE quals are too */
+ 		if (f->quals)
+ 			*usable_quals = lappend(*usable_quals, f->quals);
+ 	}
+ 	else if (IsA(jtnode, JoinExpr))
+ 	{
+ 		JoinExpr   *j = (JoinExpr *) jtnode;
+ 
+ 		switch (j->jointype)
+ 		{
+ 			case JOIN_INNER:
+ 				/* visit both children */
+ 				find_innerjoined_rels(j->larg,
+ 									  innerjoined_rels, usable_quals);
+ 				find_innerjoined_rels(j->rarg,
+ 									  innerjoined_rels, usable_quals);
+ 				/* and grab the ON quals too */
+ 				if (j->quals)
+ 					*usable_quals = lappend(*usable_quals, j->quals);
+ 				break;
+ 
+ 			case JOIN_LEFT:
+ 			case JOIN_SEMI:
+ 			case JOIN_ANTI:
+ 
+ 				/*
+ 				 * Only the left input is possibly non-nullable; furthermore,
+ 				 * the quals of this join don't constrain the left input.
+ 				 * Note: we probably can't see SEMI or ANTI joins at this
+ 				 * point, but if we do, we can treat them like LEFT joins.
+ 				 */
+ 				find_innerjoined_rels(j->larg,
+ 									  innerjoined_rels, usable_quals);
+ 				break;
+ 
+ 			case JOIN_RIGHT:
+ 				/* Reverse of the above case */
+ 				find_innerjoined_rels(j->rarg,
+ 									  innerjoined_rels, usable_quals);
+ 				break;
+ 
+ 			case JOIN_FULL:
+ 				/* Neither side is non-nullable, so stop descending */
+ 				break;
+ 
+ 			default:
+ 				elog(ERROR, "unrecognized join type: %d",
+ 					 (int) j->jointype);
+ 		}
+ 	}
+ 	else
+ 		elog(ERROR, "unrecognized node type: %d",
+ 			 (int) nodeTag(jtnode));
+ }
+ 
+ /*
   * Can we treat a ScalarArrayOpExpr as strict?
   *
   * If "falseOK" is true, then a "false" result can be considered strict,
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4b5ef99..1d581a8 100644
*** a/src/backend/utils/cache/lsyscache.c
--- b/src/backend/utils/cache/lsyscache.c
*************** get_atttypetypmodcoll(Oid relid, AttrNum
*** 972,977 ****
--- 972,1004 ----
  	ReleaseSysCache(tp);
  }
  
+ /*
+  * get_attnotnull
+  *
+  *		Given the relation id and the attribute number,
+  *		return the "attnotnull" field from the attribute relation.
+  */
+ bool
+ get_attnotnull(Oid relid, AttrNumber attnum)
+ {
+ 	HeapTuple	tp;
+ 
+ 	tp = SearchSysCache2(ATTNUM,
+ 						 ObjectIdGetDatum(relid),
+ 						 Int16GetDatum(attnum));
+ 	if (HeapTupleIsValid(tp))
+ 	{
+ 		Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+ 		bool		result;
+ 
+ 		result = att_tup->attnotnull;
+ 		ReleaseSysCache(tp);
+ 		return result;
+ 	}
+ 	else
+ 		return false;
+ }
+ 
  /*				---------- COLLATION CACHE ----------					 */
  
  /*
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index dd991b1..5b25d01 100644
*** a/src/include/optimizer/clauses.h
--- b/src/include/optimizer/clauses.h
*************** extern Relids find_nonnullable_rels(Node
*** 69,74 ****
--- 69,75 ----
  extern List *find_nonnullable_vars(Node *clause);
  extern List *find_forced_null_vars(Node *clause);
  extern Var *find_forced_null_var(Node *clause);
+ extern bool query_outputs_are_not_nullable(Query *query);
  
  extern bool is_pseudo_constant_clause(Node *clause);
  extern bool is_pseudo_constant_clause_relids(Node *clause, Relids relids);
diff --git a/src/include/optimizer/subselect.h b/src/include/optimizer/subselect.h
index 5607e98..3e8bfe7 100644
*** a/src/include/optimizer/subselect.h
--- b/src/include/optimizer/subselect.h
***************
*** 18,23 ****
--- 18,24 ----
  extern void SS_process_ctes(PlannerInfo *root);
  extern JoinExpr *convert_ANY_sublink_to_join(PlannerInfo *root,
  							SubLink *sublink,
+ 							bool under_not,
  							Relids available_rels);
  extern JoinExpr *convert_EXISTS_sublink_to_join(PlannerInfo *root,
  							   SubLink *sublink,
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index f46460a..3ec200a 100644
*** a/src/include/utils/lsyscache.h
--- b/src/include/utils/lsyscache.h
*************** extern Oid	get_atttype(Oid relid, AttrNu
*** 70,75 ****
--- 70,76 ----
  extern int32 get_atttypmod(Oid relid, AttrNumber attnum);
  extern void get_atttypetypmodcoll(Oid relid, AttrNumber attnum,
  					  Oid *typid, int32 *typmod, Oid *collid);
+ extern bool get_attnotnull(Oid relid, AttrNumber attnum);
  extern char *get_collation_name(Oid colloid);
  extern char *get_constraint_name(Oid conoid);
  extern Oid	get_opclass_family(Oid opclass);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index d85a717..b63f7ac 100644
*** a/src/test/regress/expected/subselect.out
--- b/src/test/regress/expected/subselect.out
*************** select nextval('ts1');
*** 803,805 ****
--- 803,1103 ----
        11
  (1 row)
  
+ --
+ -- Check NOT IN performs an ANTI JOIN when NULLs are not possible
+ -- in the target list of the subquery.
+ --
+ BEGIN;
+ CREATE TEMP TABLE a (id INT PRIMARY KEY);
+ CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+ CREATE TEMP TABLE c (z INT NOT NULL);
+ -- ANTI JOIN. x is defined as NOT NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+                QUERY PLAN                
+ -----------------------------------------
+  Merge Anti Join
+    Merge Cond: (a.id = b.x)
+    ->  Index Only Scan using a_pkey on a
+    ->  Sort
+          Sort Key: b.x
+          ->  Seq Scan on b
+ (6 rows)
+ 
+ -- No ANTI JOIN, y can be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Seq Scan on b
+ (4 rows)
+ 
+ -- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Seq Scan on b
+ (4 rows)
+ 
+ -- ANTI JOIN 1 is a Const that is not null.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+         QUERY PLAN         
+ ---------------------------
+  Nested Loop Anti Join
+    Join Filter: (a.id = 1)
+    ->  Seq Scan on a
+    ->  Materialize
+          ->  Seq Scan on b
+ (5 rows)
+ 
+ -- No ANTI JOIN, results contain a NULL Const
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Seq Scan on b
+ (4 rows)
+ 
+ -- ANTI JOIN y = 1 means y can't be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+           QUERY PLAN           
+ -------------------------------
+  Hash Anti Join
+    Hash Cond: (a.id = b.y)
+    ->  Seq Scan on a
+    ->  Hash
+          ->  Seq Scan on b
+                Filter: (y = 1)
+ (6 rows)
+ 
+ -- No ANTI JOIN, OR condition does not ensure y = 1
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+                QUERY PLAN               
+ ----------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Seq Scan on b
+            Filter: ((y = 1) OR (x = 1))
+ (5 rows)
+ 
+ -- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+                             QUERY PLAN                             
+ -------------------------------------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Seq Scan on b
+            Filter: (((y = 1) OR (x = 1)) AND ((y = 2) OR (x = 2)))
+ (5 rows)
+ 
+ -- ANTI JOIN y must be 2, so can't be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+                         QUERY PLAN                        
+ ----------------------------------------------------------
+  Hash Anti Join
+    Hash Cond: (a.id = b.y)
+    ->  Seq Scan on a
+    ->  Hash
+          ->  Seq Scan on b
+                Filter: ((y = 2) AND ((y = 1) OR (x = 1)))
+ (6 rows)
+ 
+ -- ANTI JOIN y can be 1 or 2, but can't be null.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+                  QUERY PLAN                 
+ --------------------------------------------
+  Hash Anti Join
+    Hash Cond: (a.id = b.y)
+    ->  Seq Scan on a
+    ->  Hash
+          ->  Seq Scan on b
+                Filter: ((y = 1) OR (y = 2))
+ (6 rows)
+ 
+ -- No ANTI JOIN c.z is from a left outer join so it can have nulls.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Merge Left Join
+            Merge Cond: (b.x = c.z)
+            ->  Sort
+                  Sort Key: b.x
+                  ->  Seq Scan on b
+            ->  Sort
+                  Sort Key: c.z
+                  ->  Seq Scan on c
+ (11 rows)
+ 
+ -- ANTI JOIN, c.z is not from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
+                      QUERY PLAN                      
+ -----------------------------------------------------
+  Merge Left Join
+    Merge Cond: (c.z = b.x)
+    ->  Sort
+          Sort Key: c.z
+          ->  Merge Anti Join
+                Merge Cond: (a.id = c.z)
+                ->  Index Only Scan using a_pkey on a
+                ->  Sort
+                      Sort Key: c.z
+                      ->  Seq Scan on c
+    ->  Sort
+          Sort Key: b.x
+          ->  Seq Scan on b
+ (13 rows)
+ 
+ -- No ANTI JOIN, b.x is from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Merge Right Join
+            Merge Cond: (b.x = c.z)
+            ->  Sort
+                  Sort Key: b.x
+                  ->  Seq Scan on b
+            ->  Sort
+                  Sort Key: c.z
+                  ->  Seq Scan on c
+ (11 rows)
+ 
+ -- No ANTI JOIN, c.z is not from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Merge Full Join
+            Merge Cond: (b.x = c.z)
+            ->  Sort
+                  Sort Key: b.x
+                  ->  Seq Scan on b
+            ->  Sort
+                  Sort Key: c.z
+                  ->  Seq Scan on c
+ (11 rows)
+ 
+ -- No ANTI JOIN, b.x is from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b FULL JOIN c ON b.x = c.z);
+              QUERY PLAN             
+ ------------------------------------
+  Seq Scan on a
+    Filter: (NOT (hashed SubPlan 1))
+    SubPlan 1
+      ->  Merge Full Join
+            Merge Cond: (b.x = c.z)
+            ->  Sort
+                  Sort Key: b.x
+                  ->  Seq Scan on b
+            ->  Sort
+                  Sort Key: c.z
+                  ->  Seq Scan on c
+ (11 rows)
+ 
+ -- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+                QUERY PLAN                
+ -----------------------------------------
+  Merge Anti Join
+    Merge Cond: (a.id = c.z)
+    ->  Index Only Scan using a_pkey on a
+    ->  Materialize
+          ->  Merge Join
+                Merge Cond: (b.x = c.z)
+                ->  Sort
+                      Sort Key: b.x
+                      ->  Seq Scan on b
+                ->  Sort
+                      Sort Key: c.z
+                      ->  Seq Scan on c
+ (12 rows)
+ 
+ -- ANTI JOIN, c.z must be 1
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+                 QUERY PLAN                 
+ -------------------------------------------
+  Hash Anti Join
+    Hash Cond: (a.id = c.z)
+    ->  Seq Scan on a
+    ->  Hash
+          ->  Nested Loop
+                ->  Seq Scan on c
+                      Filter: (z = 1)
+                ->  Materialize
+                      ->  Seq Scan on b
+                            Filter: (x = 1)
+ (10 rows)
+ 
+ -- ANTI JOIN, c.z can't be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+                     QUERY PLAN                     
+ ---------------------------------------------------
+  Merge Anti Join
+    Merge Cond: (a.id = c.z)
+    ->  Index Only Scan using a_pkey on a
+    ->  Materialize
+          ->  Merge Join
+                Merge Cond: (b.x = c.z)
+                ->  Sort
+                      Sort Key: b.x
+                      ->  Seq Scan on b
+                ->  Sort
+                      Sort Key: c.z
+                      ->  Seq Scan on c
+                            Filter: (z IS NOT NULL)
+ (13 rows)
+ 
+ ALTER TABLE c ADD COLUMN x INT;
+ -- ANTI JOIN, x cannot be NULL as b.x has a NOT NULL constraint
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT x FROM b NATURAL JOIN c);
+                QUERY PLAN                
+ -----------------------------------------
+  Merge Anti Join
+    Merge Cond: (a.id = b.x)
+    ->  Index Only Scan using a_pkey on a
+    ->  Materialize
+          ->  Merge Join
+                Merge Cond: (b.x = c.x)
+                ->  Sort
+                      Sort Key: b.x
+                      ->  Seq Scan on b
+                ->  Sort
+                      Sort Key: c.x
+                      ->  Seq Scan on c
+ (12 rows)
+ 
+ ROLLBACK;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index c3b4773..c3ca67f 100644
*** a/src/test/regress/sql/subselect.sql
--- b/src/test/regress/sql/subselect.sql
*************** select * from
*** 444,446 ****
--- 444,538 ----
    order by 1;
  
  select nextval('ts1');
+ 
+ --
+ -- Check NOT IN performs an ANTI JOIN when NULLs are not possible
+ -- in the target list of the subquery.
+ --
+ 
+ BEGIN;
+ 
+ CREATE TEMP TABLE a (id INT PRIMARY KEY);
+ CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+ CREATE TEMP TABLE c (z INT NOT NULL);
+ 
+ -- ANTI JOIN. x is defined as NOT NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+ 
+ -- No ANTI JOIN, y can be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+ 
+ -- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+ 
+ -- ANTI JOIN 1 is a Const that is not null.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+ 
+ -- No ANTI JOIN, results contain a NULL Const
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+ 
+ -- ANTI JOIN y = 1 means y can't be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+ 
+ -- No ANTI JOIN, OR condition does not ensure y = 1
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+ 
+ -- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+ 
+ -- ANTI JOIN y must be 2, so can't be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+ 
+ -- ANTI JOIN y can be 1 or 2, but can't be null.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+ 
+ -- No ANTI JOIN c.z is from a left outer join so it can have nulls.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+ 
+ -- ANTI JOIN, c.z is not from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
+ 
+ -- No ANTI JOIN, b.x is from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
+ 
+ -- No ANTI JOIN, c.z is not from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
+ 
+ -- No ANTI JOIN, b.x is from an outer join
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b FULL JOIN c ON b.x = c.z);
+ 
+ -- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+ 
+ -- ANTI JOIN, c.z must be 1
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+ 
+ -- ANTI JOIN, c.z can't be NULL
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+ 
+ ALTER TABLE c ADD COLUMN x INT;
+ 
+ -- ANTI JOIN, x cannot be NULL as b.x has a NOT NULL constraint
+ EXPLAIN (COSTS OFF)
+ SELECT * FROM a WHERE id NOT IN (SELECT x FROM b NATURAL JOIN c);
+ 
+ 
+ ROLLBACK;
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#34)
Re: Allowing NOT IN to use ANTI joins

I wrote:

We could no doubt fix this by also insisting that the left-side vars
be provably not null, but that's going to make the patch even slower
and even less often applicable. I'm feeling discouraged about whether
this is worth doing in this form.

Hm ... actually, there might be a better answer: what about transforming

WHERE (x,y) NOT IN (SELECT provably-not-null-values FROM ...)

to

WHERE <antijoin condition> AND x IS NOT NULL AND y IS NOT NULL

?

Of course this would require x/y not being volatile, but if they are,
we're not going to get far with optimizing the query anyhow.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#35)
Re: Allowing NOT IN to use ANTI joins

On Fri, Jul 11, 2014 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

We could no doubt fix this by also insisting that the left-side vars
be provably not null, but that's going to make the patch even slower
and even less often applicable. I'm feeling discouraged about whether
this is worth doing in this form.

:-( seems I didn't do my analysis very well on that one.

Hm ... actually, there might be a better answer: what about transforming

WHERE (x,y) NOT IN (SELECT provably-not-null-values FROM ...)

to

WHERE <antijoin condition> AND x IS NOT NULL AND y IS NOT NULL

?

I think this is the way to go.
It's basically what I had to do with the WIP patch I have here for SEMI
JOIN removal, where when a IN() or EXISTS type join could be removed due to
the existence of a foreign key, the NULL values still need to be filtered
out.

Perhaps it would be possible for a future patch to check get_attnotnull and
remove these again in eval_const_expressions, if the column can't be null.

Thanks for taking the time to fix up the weirdness with the NATURAL joins
and also making use of the join condition to prove not null-ability.

I'll try and get some time soon to look into adding the IS NOT NULL quals,
unless you were thinking of looking again?

Regards

David Rowley

Show quoted text

Of course this would require x/y not being volatile, but if they are,
we're not going to get far with optimizing the query anyhow.

regards, tom lane

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#36)
Re: Allowing NOT IN to use ANTI joins

David Rowley <dgrowleyml@gmail.com> writes:

On Fri, Jul 11, 2014 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm ... actually, there might be a better answer: what about transforming
WHERE (x,y) NOT IN (SELECT provably-not-null-values FROM ...)
to
WHERE <antijoin condition> AND x IS NOT NULL AND y IS NOT NULL

I think this is the way to go.

I'll try and get some time soon to look into adding the IS NOT NULL quals,
unless you were thinking of looking again?

Go for it, I've got a lot of other stuff on my plate.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#35)
Re: Allowing NOT IN to use ANTI joins

On Fri, Jul 11, 2014 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

We could no doubt fix this by also insisting that the left-side vars
be provably not null, but that's going to make the patch even slower
and even less often applicable. I'm feeling discouraged about whether
this is worth doing in this form.

Hm ... actually, there might be a better answer: what about transforming

WHERE (x,y) NOT IN (SELECT provably-not-null-values FROM ...)

to

WHERE <antijoin condition> AND x IS NOT NULL AND y IS NOT NULL

?

I had another look at this and it appears you were right the first time, we
need to ensure there's no NULLs on both sides of the join condition.

The reason for this is that there's a special case with "WHERE col NOT
IN(SELECT id from empty_relation)", this is effectively the same as "WHERE
true", so we should see *all* rows, even ones where col is null. Adding a
col IS NOT NULL cannot be done as it would filter out the NULLs in this
special case.

The only other way I could imagine fixing this would be to have some other
sort of join type that always met the join condition if the right side of
the join had no tuples... Of course I'm not suggesting it gets implemented
this way, I'm just otherwise out of ideas.

Regards

David Rowley

#39Andres Freund
andres@2ndquadrant.com
In reply to: David Rowley (#38)
Re: Allowing NOT IN to use ANTI joins

On 2014-07-13 23:06:15 +1200, David Rowley wrote:

I had another look at this and it appears you were right the first time, we
need to ensure there's no NULLs on both sides of the join condition.

The patch is currently marked as 'ready for committer' - that doesn't
seem to correspond to the discussion. Marked as 'Waiting for Author'. Ok?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#38)
Re: Allowing NOT IN to use ANTI joins

David Rowley <dgrowleyml@gmail.com> writes:

I had another look at this and it appears you were right the first time, we
need to ensure there's no NULLs on both sides of the join condition.

Ugh. I'm back to being discouraged about the usefulness of the
optimization.

The only other way I could imagine fixing this would be to have some other
sort of join type that always met the join condition if the right side of
the join had no tuples... Of course I'm not suggesting it gets implemented
this way, I'm just otherwise out of ideas.

IIRC, we looked into implementing a true NOT IN join operator years ago.
Not only is it messy as can be, but there are patents in the area :-(.
So anything more than the most brain-dead-simple approach would be
risky.

I could see implementing a variant join operator in the hash join code,
since there you get to look at the entire inner relation before you have
to give any answers. You could easily detect both empty-inner and
inner-contains-nulls and modify the results of matching appropriately.
However, it's not apparent how that could be made to work for either
mergejoin or nestloop-with-inner-index-scan, which greatly limits the
usefulness of the approach. Worse yet, I think this'd be at best a
marginal improvement on the existing hashed-subplan code path.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#40)
Re: Allowing NOT IN to use ANTI joins

On Mon, Jul 14, 2014 at 3:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I had another look at this and it appears you were right the first time,

we

need to ensure there's no NULLs on both sides of the join condition.

Ugh. I'm back to being discouraged about the usefulness of the
optimization.

Are you worried about the planning overhead of the not null checks, or is
it more that you think there's a much smaller chance of a real world
situation that the optimisation will succeed? At least the planning
overhead is limited to query's that have NOT IN clauses.

I'm still quite positive about the patch. I think that it would just be a
matter of modifying query_outputs_are_not_nullable() giving it a nice new
name and changing the parameter list to accept not only a Query, but also a
List of Expr. Likely this would be quite a nice reusable function that
likely could be used in a handful of other places in the planner to
optimise various other cases.

When I first decided to work on this I was more interested in getting some
planner knowledge about NOT NULL constraints than I was interested in
speeding up NOT IN, but it seemed like a perfect target or even "excuse" to
draft up some code that checks if an expr can never be NULL.

Since the patch has not been marked as rejected I was thinking that I'd
take a bash at fixing it up, but if you think this is a waste of time,
please let me know.

Regards

David Rowley

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#41)
Re: Allowing NOT IN to use ANTI joins

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, Jul 14, 2014 at 3:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ugh. I'm back to being discouraged about the usefulness of the
optimization.

Are you worried about the planning overhead of the not null checks, or is
it more that you think there's a much smaller chance of a real world
situation that the optimisation will succeed?

Both. We need to look at how much it costs the planner to run these
checks, and think about how many real queries it will help for. The
first is quantifiable, the second probably not so much :-( but we still
need to ask the question.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#41)
4 attachment(s)
Re: Allowing NOT IN to use ANTI joins

On Mon, Jul 14, 2014 at 8:55 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Since the patch has not been marked as rejected I was thinking that I'd
take a bash at fixing it up, but if you think this is a waste of time,
please let me know.

I've made some changes to the patch so that it only allows the conversion
to ANTI JOIN to take place if both the outer query's expressions AND the
subquery's target list can be proved not to have NULLs.

I've attached a delta, which is the changes I've made on top of Tom's
cleaned up version of my patch, and also a full patch.

I've also performed some benchmarks to try to determine how much time it
takes to execute this null checking code. I ended up hacking the code a
little for the benchmarks and just put the null checking function in a
tight loop that performed 100000 iterations.

Like:
if (under_not)
{
int x;
bool result;
for (x = 0; x < 100000; x++)
{
result = is_NOTANY_compatible_with_antijoin(parse, sublink);
}
if (!result)
return NULL;
}

I then ran 6 queries, 3 times each through the planner and grabbed the
"Planning Time" from the explain analyze result.
I then removed the extra looping code (seen above) and compiled the code as
it is with the attached patch.
I then ran each of the 6 queries again 3 times each and noted down the
"Planning Time from the explain analyze result.

In my results I assumed that the first set of times divided by 100000
would be the time taken to perform the NULL checks... This is not quite
accurate, but all the other planning work was quite well drowned out by the
100k loop.

I found that the call to is_NOTANY_compatible_with_antijoin adds about 0.2%
and 2.3% to total planning time. Though the 2.3% was quite an extreme case,
and the 0.2% was the most simple case I could think of.

I've attached the complete results in html format. I've also attached the
schema that I used and all 6 queries tested.

Here's 2 points which I think are important to note about the planning time
overhead of this patch:
1. There is no additional overhead if the query has no NOT IN clause.
2. The test queries 3 and 6 were to benchmark the overhead of when the NOT
NULL test fails. The slowest of these was test 3 which added just under
0.5% to the planning time. The query that added a 2.3% overhead performed
an ANTI JOIN, so likely the reduction in execution time more than made up
for the extra planning time.

Regards

David Rowley

Attachments:

not_in_benchmark_schema.sqltext/plain; charset=US-ASCII; name=not_in_benchmark_schema.sqlDownload
NOTIN_Planner_Benchmark.htmtext/html; charset=US-ASCII; name=NOTIN_Planner_Benchmark.htmDownload
not_in_anti_join_v0.9.delta.patchapplication/octet-stream; name=not_in_anti_join_v0.9.delta.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 4b44662..6ebab4d 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -70,6 +70,8 @@ static Node *convert_testexpr_mutator(Node *node,
 static bool subplan_is_hashable(Plan *plan);
 static bool testexpr_is_hashable(Node *testexpr);
 static bool hash_ok_operator(OpExpr *expr);
+static bool is_NOTANY_compatible_with_antijoin(Query *outerquery,
+				 SubLink *sublink);
 static bool simplify_EXISTS_query(Query *query);
 static Query *convert_EXISTS_to_ANY(PlannerInfo *root, Query *subselect,
 					  Node **testexpr, List **paramIds);
@@ -1187,6 +1189,117 @@ SS_process_ctes(PlannerInfo *root)
 }
 
 /*
+ * is_NOTANY_compatible_with_antijoin
+ *		True if the NOT IN sublink can be safely converted into an ANTI JOIN.
+ *		Per SQL spec, NOT IN is not ordinarily equivalent to an anti-join,
+ *		however, if we can prove that all of the expressions on both sides of
+ *		the would be join condition are all certainly not NULL, then it's safe
+ *		to convert NOT IN to an anti-join.
+ *
+ * Note: This function is quite locked into the NOT IN syntax. Certain
+ *       assumptions are made about the structure of the join conditions:
+ *
+ * 1. We assume that when more than 1 join condition exists that these are AND
+ *    type conditions, i.e not OR conditions.
+ *
+ * 2. We assume that each join qual has 2 arguments and the first arguments in
+ *    each join qual is the one that belongs to the outer side of the NOT IN
+ *    clause.
+ */
+static bool
+is_NOTANY_compatible_with_antijoin(Query *outerquery, SubLink *sublink)
+{
+	ListCell *lc;
+	List	 *outerexpr;
+	List	 *innerexpr;
+	Node	 *testexpr = sublink->testexpr;
+	Query	 *subselect = (Query *) sublink->subselect;
+
+	/*
+	 * We must build 2 lists of expressions, one for the outer query side and
+	 * one for the subquery/inner side of the NOT IN clause. It is these lists
+	 * that we pass to to expressions_are_not_nullable to allow it to determine
+	 * if each expression cannot contain NULL values or not. We'll try and be
+	 * as lazy about this as possible and we won't bother generating the 2nd
+	 * list if the first list can't be proved to be free from null-able
+	 * expressions. The order that we checks these lists does not really matter
+	 * as neither one seems to be more likely to allow us to exit from this
+	 * function more quickly.
+	 */
+
+	/* if it's a single expression */
+	if (IsA(testexpr, OpExpr))
+	{
+		OpExpr *opexpr = (OpExpr *) testexpr;
+
+		Assert(list_length(opexpr->args) == 2);
+
+		outerexpr = lappend(NIL, linitial(opexpr->args));
+	}
+
+	/* multiple expressions ANDed together */
+	else if (IsA(testexpr, BoolExpr))
+	{
+		List	 *list = ((BoolExpr *) testexpr)->args;
+
+		outerexpr = NIL;
+
+		 /* loop through each expression appending to the list each iteration. */
+		foreach(lc, list)
+		{
+			OpExpr *opexpr = (OpExpr *) lfirst(lc);
+
+			Assert(list_length(opexpr->args) == 2);
+
+			outerexpr = lappend(outerexpr, linitial(opexpr->args));
+		}
+	}
+	else
+		elog(ERROR, "unrecognized node type: %d",
+					(int) nodeTag(testexpr));
+
+	if (outerexpr == NIL)
+		elog(ERROR, "unexpected empty clause");
+
+	/*
+	 * Check the expressions being used in the outer query side of the NOT IN
+	 * clause to ensure NULLs are not possible.
+	 */
+	if (!expressions_are_not_nullable(outerquery, outerexpr))
+		return false;
+
+	/*
+	 * Now we check to ensure each TargetEntry in the subquery can be proved to
+	 * never be NULL.
+	 */
+
+	innerexpr = NIL;
+
+	foreach(lc, subselect->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(lc);
+
+		/* Resjunk columns can be ignored: they don't produce output values */
+		if (tle->resjunk)
+			continue;
+
+		innerexpr = lappend(innerexpr, tle->expr);
+	}
+
+	if (innerexpr == NIL)
+		elog(ERROR, "unexpected empty clause");
+
+	/*
+	 * Check if all the subquery's targetlist columns can be proved to be not
+	 * null.
+	 */
+	if (!expressions_are_not_nullable(subselect, innerexpr))
+		return false;
+
+	return true; /* supports ANTI JOIN */
+}
+
+/*
  * convert_ANY_sublink_to_join: try to convert an ANY SubLink to a join
  *
  * The caller has found an ANY SubLink at the top level of one of the query's
@@ -1242,10 +1355,11 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	/*
 	 * Per SQL spec, NOT IN is not ordinarily equivalent to an anti-join, so
 	 * that by default we have to fail when under_not.  However, if we can
-	 * prove that the sub-select's output columns are all certainly not NULL,
-	 * then it's safe to convert NOT IN to an anti-join.
+	 * prove that all of the expressions on both sides of the, would be, join
+	 * condition are all certainly not NULL, then it's safe to convert NOT IN
+	 * to an anti-join.
 	 */
-	if (under_not && !query_outputs_are_not_nullable(subselect))
+	if (under_not && !is_NOTANY_compatible_with_antijoin(parse, sublink))
 		return NULL;
 
 	/*
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f8e3eaa..dd02327 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -1994,9 +1994,9 @@ find_forced_null_var(Node *node)
 }
 
 /*
- * query_outputs_are_not_nullable
- *		Returns TRUE if the output values of the Query are certainly not NULL.
- *		All output columns must return non-NULL to answer TRUE.
+ * expressions_are_not_nullable
+ *		Returns TRUE if each Expr in the expression list is certainly not NULL.
+ *		All Exprs must return non-NULL to answer TRUE.
  *
  * The reason this takes a Query, and not just an individual tlist expression,
  * is so that we can make use of the query's WHERE/ON clauses to prove it does
@@ -2012,7 +2012,7 @@ find_forced_null_var(Node *node)
  * side of conservatism: if we're not sure, it's okay to return FALSE.
  */
 bool
-query_outputs_are_not_nullable(Query *query)
+expressions_are_not_nullable(Query *query, List *exprlist)
 {
 	PlannerInfo subroot;
 	Relids		innerjoined_rels = NULL;
@@ -2020,7 +2020,7 @@ query_outputs_are_not_nullable(Query *query)
 	List	   *usable_quals = NIL;
 	List	   *nonnullable_vars = NIL;
 	bool		computed_nonnullable_vars = false;
-	ListCell   *tl;
+	ListCell   *lc;
 
 	/*
 	 * If the query contains set operations, punt.  The set ops themselves
@@ -2044,16 +2044,11 @@ query_outputs_are_not_nullable(Query *query)
 	subroot.parse = query;
 
 	/*
-	 * Examine each targetlist entry to prove that it can't produce NULL.
+	 * Examine each Expr to prove that it can't produce NULL.
 	 */
-	foreach(tl, query->targetList)
+	foreach(lc, exprlist)
 	{
-		TargetEntry *tle = (TargetEntry *) lfirst(tl);
-		Expr	   *expr = tle->expr;
-
-		/* Resjunk columns can be ignored: they don't produce output values */
-		if (tle->resjunk)
-			continue;
+		Expr *expr = (Expr *) lfirst(lc);
 
 		/*
 		 * For the most part we don't try to deal with anything more complex
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5b25d01..90f745c 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -69,7 +69,7 @@ extern Relids find_nonnullable_rels(Node *clause);
 extern List *find_nonnullable_vars(Node *clause);
 extern List *find_forced_null_vars(Node *clause);
 extern Var *find_forced_null_var(Node *clause);
-extern bool query_outputs_are_not_nullable(Query *query);
+extern bool expressions_are_not_nullable(Query *query, List *exprlist);
 
 extern bool is_pseudo_constant_clause(Node *clause);
 extern bool is_pseudo_constant_clause_relids(Node *clause, Relids relids);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index b63f7ac..610c175 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -805,24 +805,96 @@ select nextval('ts1');
 
 --
 -- Check NOT IN performs an ANTI JOIN when NULLs are not possible
--- in the target list of the subquery.
+-- on either side of the, would be, join condition.
 --
 BEGIN;
 CREATE TEMP TABLE a (id INT PRIMARY KEY);
 CREATE TEMP TABLE b (x INT NOT NULL, y INT);
 CREATE TEMP TABLE c (z INT NOT NULL);
--- ANTI JOIN. x is defined as NOT NULL
+INSERT INTO b VALUES(1,1),(2,2),(3,NULL);
+-- No ANTI JOIN, b.x is from an outer join
 EXPLAIN (COSTS OFF)
-SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
-               QUERY PLAN                
------------------------------------------
- Merge Anti Join
-   Merge Cond: (a.id = b.x)
-   ->  Index Only Scan using a_pkey on a
-   ->  Sort
-         Sort Key: b.x
+SELECT * FROM a 
+LEFT OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT z FROM c);
+             QUERY PLAN             
+------------------------------------
+ Hash Right Join
+   Hash Cond: (b.x = a.id)
+   Filter: (NOT (hashed SubPlan 1))
+   ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+   SubPlan 1
+     ->  Seq Scan on c
+(8 rows)
+
+-- ANTI JOIN, b.x is from an outer join but b.x > 100
+-- forces the join not to produce NULL on the righthand
+-- side.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a 
+LEFT OUTER JOIN b ON a.id = b.x 
+WHERE b.x NOT IN(SELECT z FROM c)
+  AND b.x > 100;
+           QUERY PLAN            
+---------------------------------
+ Hash Join
+   Hash Cond: (b.x = a.id)
+   ->  Hash Anti Join
+         Hash Cond: (b.x = c.z)
          ->  Seq Scan on b
-(6 rows)
+               Filter: (x > 100)
+         ->  Hash
+               ->  Seq Scan on c
+   ->  Hash
+         ->  Seq Scan on a
+(10 rows)
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a 
+FULL OUTER JOIN b ON a.id = b.x 
+WHERE b.x NOT IN(SELECT y FROM c);
+         QUERY PLAN          
+-----------------------------
+ Hash Full Join
+   Hash Cond: (b.x = a.id)
+   Filter: (NOT (SubPlan 1))
+   ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+   SubPlan 1
+     ->  Seq Scan on c
+(8 rows)
+
+-- No ANTI JOIN. y can have NULLs
+EXPLAIN (COSTS OFF)
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on b
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on c
+(4 rows)
+
+-- c is an empty relation so should cause no filtering on b
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+ x | y 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 |  
+(3 rows)
+
+INSERT INTO c VALUES(1);
+-- Records where y is NULL should be filtered out.
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+ x | y 
+---+---
+ 2 | 2
+(1 row)
 
 -- No ANTI JOIN, y can be NULL
 EXPLAIN (COSTS OFF)
@@ -988,7 +1060,7 @@ SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
                  ->  Seq Scan on c
 (11 rows)
 
--- No ANTI JOIN, c.z is not from an outer join
+-- No ANTI JOIN, c.z is from an outer join
 EXPLAIN (COSTS OFF)
 SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
              QUERY PLAN             
@@ -1080,23 +1152,22 @@ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHER
                            Filter: (z IS NOT NULL)
 (13 rows)
 
-ALTER TABLE c ADD COLUMN x INT;
--- ANTI JOIN, x cannot be NULL as b.x has a NOT NULL constraint
+-- ANTI JOIN, b.y cannot be NULL due to the join condition b.y = c.z
 EXPLAIN (COSTS OFF)
-SELECT * FROM a WHERE id NOT IN (SELECT x FROM b NATURAL JOIN c);
+SELECT * FROM a WHERE id NOT IN (SELECT b.y FROM b INNER JOIN c ON b.y = c.z);
                QUERY PLAN                
 -----------------------------------------
  Merge Anti Join
-   Merge Cond: (a.id = b.x)
+   Merge Cond: (a.id = b.y)
    ->  Index Only Scan using a_pkey on a
    ->  Materialize
          ->  Merge Join
-               Merge Cond: (b.x = c.x)
+               Merge Cond: (b.y = c.z)
                ->  Sort
-                     Sort Key: b.x
+                     Sort Key: b.y
                      ->  Seq Scan on b
                ->  Sort
-                     Sort Key: c.x
+                     Sort Key: c.z
                      ->  Seq Scan on c
 (12 rows)
 
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index c3ca67f..c4f9102 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -447,7 +447,7 @@ select nextval('ts1');
 
 --
 -- Check NOT IN performs an ANTI JOIN when NULLs are not possible
--- in the target list of the subquery.
+-- on either side of the, would be, join condition.
 --
 
 BEGIN;
@@ -456,9 +456,40 @@ CREATE TEMP TABLE a (id INT PRIMARY KEY);
 CREATE TEMP TABLE b (x INT NOT NULL, y INT);
 CREATE TEMP TABLE c (z INT NOT NULL);
 
--- ANTI JOIN. x is defined as NOT NULL
+INSERT INTO b VALUES(1,1),(2,2),(3,NULL);
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a
+LEFT OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT z FROM c);
+
+-- ANTI JOIN, b.x is from an outer join but b.x > 100
+-- forces the join not to produce NULL on the righthand
+-- side.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a
+LEFT OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT z FROM c)
+  AND b.x > 100;
+
+-- No ANTI JOIN, b.x is from an outer join
 EXPLAIN (COSTS OFF)
-SELECT * FROM a WHERE id NOT IN (SELECT x FROM b);
+SELECT * FROM a
+FULL OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT y FROM c);
+
+-- No ANTI JOIN. y can have NULLs
+EXPLAIN (COSTS OFF)
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+
+-- c is an empty relation so should cause no filtering on b
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+
+INSERT INTO c VALUES(1);
+
+-- Records where y is NULL should be filtered out.
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
 
 -- No ANTI JOIN, y can be NULL
 EXPLAIN (COSTS OFF)
@@ -508,7 +539,7 @@ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
 EXPLAIN (COSTS OFF)
 SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
 
--- No ANTI JOIN, c.z is not from an outer join
+-- No ANTI JOIN, c.z is from an outer join
 EXPLAIN (COSTS OFF)
 SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
 
@@ -528,11 +559,8 @@ SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHER
 EXPLAIN (COSTS OFF)
 SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
 
-ALTER TABLE c ADD COLUMN x INT;
-
--- ANTI JOIN, x cannot be NULL as b.x has a NOT NULL constraint
+-- ANTI JOIN, b.y cannot be NULL due to the join condition b.y = c.z
 EXPLAIN (COSTS OFF)
-SELECT * FROM a WHERE id NOT IN (SELECT x FROM b NATURAL JOIN c);
-
+SELECT * FROM a WHERE id NOT IN (SELECT b.y FROM b INNER JOIN c ON b.y = c.z);
 
 ROLLBACK;
not_in_anti_join_v0.9.patchapplication/octet-stream; name=not_in_anti_join_v0.9.patchDownload
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..6ebab4d 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -70,6 +70,8 @@ static Node *convert_testexpr_mutator(Node *node,
 static bool subplan_is_hashable(Plan *plan);
 static bool testexpr_is_hashable(Node *testexpr);
 static bool hash_ok_operator(OpExpr *expr);
+static bool is_NOTANY_compatible_with_antijoin(Query *outerquery,
+				 SubLink *sublink);
 static bool simplify_EXISTS_query(Query *query);
 static Query *convert_EXISTS_to_ANY(PlannerInfo *root, Query *subselect,
 					  Node **testexpr, List **paramIds);
@@ -1187,6 +1189,117 @@ SS_process_ctes(PlannerInfo *root)
 }
 
 /*
+ * is_NOTANY_compatible_with_antijoin
+ *		True if the NOT IN sublink can be safely converted into an ANTI JOIN.
+ *		Per SQL spec, NOT IN is not ordinarily equivalent to an anti-join,
+ *		however, if we can prove that all of the expressions on both sides of
+ *		the would be join condition are all certainly not NULL, then it's safe
+ *		to convert NOT IN to an anti-join.
+ *
+ * Note: This function is quite locked into the NOT IN syntax. Certain
+ *       assumptions are made about the structure of the join conditions:
+ *
+ * 1. We assume that when more than 1 join condition exists that these are AND
+ *    type conditions, i.e not OR conditions.
+ *
+ * 2. We assume that each join qual has 2 arguments and the first arguments in
+ *    each join qual is the one that belongs to the outer side of the NOT IN
+ *    clause.
+ */
+static bool
+is_NOTANY_compatible_with_antijoin(Query *outerquery, SubLink *sublink)
+{
+	ListCell *lc;
+	List	 *outerexpr;
+	List	 *innerexpr;
+	Node	 *testexpr = sublink->testexpr;
+	Query	 *subselect = (Query *) sublink->subselect;
+
+	/*
+	 * We must build 2 lists of expressions, one for the outer query side and
+	 * one for the subquery/inner side of the NOT IN clause. It is these lists
+	 * that we pass to to expressions_are_not_nullable to allow it to determine
+	 * if each expression cannot contain NULL values or not. We'll try and be
+	 * as lazy about this as possible and we won't bother generating the 2nd
+	 * list if the first list can't be proved to be free from null-able
+	 * expressions. The order that we checks these lists does not really matter
+	 * as neither one seems to be more likely to allow us to exit from this
+	 * function more quickly.
+	 */
+
+	/* if it's a single expression */
+	if (IsA(testexpr, OpExpr))
+	{
+		OpExpr *opexpr = (OpExpr *) testexpr;
+
+		Assert(list_length(opexpr->args) == 2);
+
+		outerexpr = lappend(NIL, linitial(opexpr->args));
+	}
+
+	/* multiple expressions ANDed together */
+	else if (IsA(testexpr, BoolExpr))
+	{
+		List	 *list = ((BoolExpr *) testexpr)->args;
+
+		outerexpr = NIL;
+
+		 /* loop through each expression appending to the list each iteration. */
+		foreach(lc, list)
+		{
+			OpExpr *opexpr = (OpExpr *) lfirst(lc);
+
+			Assert(list_length(opexpr->args) == 2);
+
+			outerexpr = lappend(outerexpr, linitial(opexpr->args));
+		}
+	}
+	else
+		elog(ERROR, "unrecognized node type: %d",
+					(int) nodeTag(testexpr));
+
+	if (outerexpr == NIL)
+		elog(ERROR, "unexpected empty clause");
+
+	/*
+	 * Check the expressions being used in the outer query side of the NOT IN
+	 * clause to ensure NULLs are not possible.
+	 */
+	if (!expressions_are_not_nullable(outerquery, outerexpr))
+		return false;
+
+	/*
+	 * Now we check to ensure each TargetEntry in the subquery can be proved to
+	 * never be NULL.
+	 */
+
+	innerexpr = NIL;
+
+	foreach(lc, subselect->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(lc);
+
+		/* Resjunk columns can be ignored: they don't produce output values */
+		if (tle->resjunk)
+			continue;
+
+		innerexpr = lappend(innerexpr, tle->expr);
+	}
+
+	if (innerexpr == NIL)
+		elog(ERROR, "unexpected empty clause");
+
+	/*
+	 * Check if all the subquery's targetlist columns can be proved to be not
+	 * null.
+	 */
+	if (!expressions_are_not_nullable(subselect, innerexpr))
+		return false;
+
+	return true; /* supports ANTI JOIN */
+}
+
+/*
  * convert_ANY_sublink_to_join: try to convert an ANY SubLink to a join
  *
  * The caller has found an ANY SubLink at the top level of one of the query's
@@ -1195,11 +1308,14 @@ SS_process_ctes(PlannerInfo *root)
  * If so, form a JoinExpr and return it.  Return NULL if the SubLink cannot
  * be converted to a join.
  *
- * The only non-obvious input parameter is available_rels: this is the set
- * of query rels that can safely be referenced in the sublink expression.
- * (We must restrict this to avoid changing the semantics when a sublink
- * is present in an outer join's ON qual.)  The conversion must fail if
- * the converted qual would reference any but these parent-query relids.
+ * If under_not is true, the caller actually found NOT (ANY SubLink),
+ * so that what we must try to build is an ANTI not SEMI join.
+ *
+ * available_rels is the set of query rels that can safely be referenced
+ * in the sublink expression.  (We must restrict this to avoid changing the
+ * semantics when a sublink is present in an outer join's ON qual.)
+ * The conversion must fail if the converted qual would reference any but
+ * these parent-query relids.
  *
  * On success, the returned JoinExpr has larg = NULL and rarg = the jointree
  * item representing the pulled-up subquery.  The caller must set larg to
@@ -1222,7 +1338,7 @@ SS_process_ctes(PlannerInfo *root)
  */
 JoinExpr *
 convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
-							Relids available_rels)
+							bool under_not, Relids available_rels)
 {
 	JoinExpr   *result;
 	Query	   *parse = root->parse;
@@ -1237,6 +1353,16 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink->subLinkType == ANY_SUBLINK);
 
 	/*
+	 * Per SQL spec, NOT IN is not ordinarily equivalent to an anti-join, so
+	 * that by default we have to fail when under_not.  However, if we can
+	 * prove that all of the expressions on both sides of the, would be, join
+	 * condition are all certainly not NULL, then it's safe to convert NOT IN
+	 * to an anti-join.
+	 */
+	if (under_not && !is_NOTANY_compatible_with_antijoin(parse, sublink))
+		return NULL;
+
+	/*
 	 * The sub-select must not refer to any Vars of the parent query. (Vars of
 	 * higher levels should be okay, though.)
 	 */
@@ -1302,7 +1428,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * And finally, build the JoinExpr node.
 	 */
 	result = makeNode(JoinExpr);
-	result->jointype = JOIN_SEMI;
+	result->jointype = under_not ? JOIN_ANTI : JOIN_SEMI;
 	result->isNatural = false;
 	result->larg = NULL;		/* caller must fill this in */
 	result->rarg = (Node *) rtr;
@@ -1317,9 +1443,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 /*
  * convert_EXISTS_sublink_to_join: try to convert an EXISTS SubLink to a join
  *
- * The API of this function is identical to convert_ANY_sublink_to_join's,
- * except that we also support the case where the caller has found NOT EXISTS,
- * so we need an additional input parameter "under_not".
+ * The API of this function is identical to convert_ANY_sublink_to_join's.
  */
 JoinExpr *
 convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 9cb1378..3a116c7 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -334,7 +334,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 		/* Is it a convertible ANY or EXISTS clause? */
 		if (sublink->subLinkType == ANY_SUBLINK)
 		{
-			if ((j = convert_ANY_sublink_to_join(root, sublink,
+			if ((j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels1)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -360,7 +360,7 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 				return NULL;
 			}
 			if (available_rels2 != NULL &&
-				(j = convert_ANY_sublink_to_join(root, sublink,
+				(j = convert_ANY_sublink_to_join(root, sublink, false,
 												 available_rels2)) != NULL)
 			{
 				/* Yes; insert the new join node into the join tree */
@@ -445,14 +445,68 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
 	}
 	if (not_clause(node))
 	{
-		/* If the immediate argument of NOT is EXISTS, try to convert */
+		/* If the immediate argument of NOT is ANY or EXISTS, try to convert */
 		SubLink    *sublink = (SubLink *) get_notclausearg((Expr *) node);
 		JoinExpr   *j;
 		Relids		child_rels;
 
 		if (sublink && IsA(sublink, SubLink))
 		{
-			if (sublink->subLinkType == EXISTS_SUBLINK)
+			if (sublink->subLinkType == ANY_SUBLINK)
+			{
+				if ((j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels1)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink1;
+					*jtlink1 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+				if (available_rels2 != NULL &&
+					(j = convert_ANY_sublink_to_join(root, sublink, true,
+												   available_rels2)) != NULL)
+				{
+					/* Yes; insert the new join node into the join tree */
+					j->larg = *jtlink2;
+					*jtlink2 = (Node *) j;
+					/* Recursively process pulled-up jointree nodes */
+					j->rarg = pull_up_sublinks_jointree_recurse(root,
+																j->rarg,
+																&child_rels);
+
+					/*
+					 * Now recursively process the pulled-up quals.  Because
+					 * we are underneath a NOT, we can't pull up sublinks that
+					 * reference the left-hand stuff, but it's still okay to
+					 * pull up sublinks referencing j->rarg.
+					 */
+					j->quals = pull_up_sublinks_qual_recurse(root,
+															 j->quals,
+															 &j->rarg,
+															 child_rels,
+															 NULL, NULL);
+					/* Return NULL representing constant TRUE */
+					return NULL;
+				}
+			}
+			else if (sublink->subLinkType == EXISTS_SUBLINK)
 			{
 				if ((j = convert_EXISTS_sublink_to_join(root, sublink, true,
 												   available_rels1)) != NULL)
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 19b5cf7..dd02327 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -40,6 +40,7 @@
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
 #include "parser/parse_func.h"
+#include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
@@ -100,6 +101,8 @@ static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaky_functions_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
 static List *find_nonnullable_vars_walker(Node *node, bool top_level);
+static void find_innerjoined_rels(Node *jtnode,
+					  Relids *innerjoined_rels, List **usable_quals);
 static bool is_strict_saop(ScalarArrayOpExpr *expr, bool falseOK);
 static Node *eval_const_expressions_mutator(Node *node,
 							   eval_const_expressions_context *context);
@@ -1459,6 +1462,10 @@ contain_leaky_functions_walker(Node *node, void *context)
 								  context);
 }
 
+/*****************************************************************************
+ *		  Nullability analysis
+ *****************************************************************************/
+
 /*
  * find_nonnullable_rels
  *		Determine which base rels are forced nonnullable by given clause.
@@ -1685,7 +1692,7 @@ find_nonnullable_rels_walker(Node *node, bool top_level)
  * but here we assume that the input is a Boolean expression, and wish to
  * see if NULL inputs will provably cause a FALSE-or-NULL result.  We expect
  * the expression to have been AND/OR flattened and converted to implicit-AND
- * format.
+ * format (but the results are still good if it wasn't AND/OR flattened).
  *
  * The result is a palloc'd List, but we have not copied the member Var nodes.
  * Also, we don't bother trying to eliminate duplicate entries.
@@ -1987,6 +1994,243 @@ find_forced_null_var(Node *node)
 }
 
 /*
+ * expressions_are_not_nullable
+ *		Returns TRUE if each Expr in the expression list is certainly not NULL.
+ *		All Exprs must return non-NULL to answer TRUE.
+ *
+ * The reason this takes a Query, and not just an individual tlist expression,
+ * is so that we can make use of the query's WHERE/ON clauses to prove it does
+ * not return nulls.
+ *
+ * In current usage, the passed sub-Query hasn't yet been through any planner
+ * processing.  This means that applying find_nonnullable_vars() to its WHERE
+ * clauses isn't really ideal: for lack of const-simplification, we might be
+ * unable to prove not-nullness in some cases where we could have proved it
+ * afterwards.  However, we should not get any false positive results.
+ *
+ * Like the other forms of nullability analysis above, we can err on the
+ * side of conservatism: if we're not sure, it's okay to return FALSE.
+ */
+bool
+expressions_are_not_nullable(Query *query, List *exprlist)
+{
+	PlannerInfo subroot;
+	Relids		innerjoined_rels = NULL;
+	bool		computed_innerjoined_rels = false;
+	List	   *usable_quals = NIL;
+	List	   *nonnullable_vars = NIL;
+	bool		computed_nonnullable_vars = false;
+	ListCell   *lc;
+
+	/*
+	 * If the query contains set operations, punt.  The set ops themselves
+	 * couldn't introduce nulls that weren't in their inputs, but the tlist
+	 * present in the top-level query is just dummy and won't give us useful
+	 * info.  We could get an answer by recursing to examine each leaf query,
+	 * but for the moment it doesn't seem worth the extra complication.
+	 *
+	 * Note that we needn't consider other top-level operators such as
+	 * DISTINCT, GROUP BY, etc, as those will not introduce nulls either.
+	 */
+	if (query->setOperations)
+		return false;
+
+	/*
+	 * We need a PlannerInfo to pass to flatten_join_alias_vars.  Fortunately,
+	 * we can cons up an entirely dummy one, because only the "parse" link in
+	 * the struct is used by flatten_join_alias_vars.
+	 */
+	MemSet(&subroot, 0, sizeof(subroot));
+	subroot.parse = query;
+
+	/*
+	 * Examine each Expr to prove that it can't produce NULL.
+	 */
+	foreach(lc, exprlist)
+	{
+		Expr *expr = (Expr *) lfirst(lc);
+
+		/*
+		 * For the most part we don't try to deal with anything more complex
+		 * than Consts and Vars; but it seems worthwhile to look through
+		 * binary relabelings, since we know those don't introduce nulls.
+		 */
+		while (expr && IsA(expr, RelabelType))
+			expr = ((RelabelType *) expr)->arg;
+
+		if (expr == NULL)		/* paranoia */
+			return false;
+
+		if (IsA(expr, Const))
+		{
+			/* Consts are easy: they're either null or not. */
+			if (((Const *) expr)->constisnull)
+				return false;
+		}
+		else if (IsA(expr, Var))
+		{
+			Var		   *var = (Var *) expr;
+
+			/* Currently, we punt for any nonlocal Vars */
+			if (var->varlevelsup != 0)
+				return false;
+
+			/*
+			 * Since the subquery hasn't yet been through expression
+			 * preprocessing, we must apply flatten_join_alias_vars to the
+			 * given Var, and to any Vars found by find_nonnullable_vars, to
+			 * avoid being fooled by join aliases.  If we get something other
+			 * than a plain Var out of the substitution, punt.
+			 */
+			var = (Var *) flatten_join_alias_vars(&subroot, (Node *) var);
+
+			if (!IsA(var, Var))
+				return false;
+			Assert(var->varlevelsup == 0);
+
+			/*
+			 * We don't bother to compute innerjoined_rels and usable_quals
+			 * until we've found a Var we must analyze.
+			 */
+			if (!computed_innerjoined_rels)
+			{
+				find_innerjoined_rels((Node *) query->jointree,
+									  &innerjoined_rels, &usable_quals);
+				computed_innerjoined_rels = true;
+			}
+
+			/*
+			 * If Var is from a plain relation, and that relation is not on
+			 * the nullable side of any outer join, and its column is marked
+			 * NOT NULL according to the catalogs, it can't produce NULL.
+			 */
+			if (bms_is_member(var->varno, innerjoined_rels))
+			{
+				RangeTblEntry *rte = rt_fetch(var->varno, query->rtable);
+
+				if (rte->rtekind == RTE_RELATION &&
+					get_attnotnull(rte->relid, var->varattno))
+					continue;	/* cannot produce NULL */
+			}
+
+			/*
+			 * Even if that didn't work, we can conclude that the Var is not
+			 * nullable if find_nonnullable_vars can find a "var IS NOT NULL"
+			 * or similarly strict condition among the usable_quals.  Compute
+			 * the list of Vars having such quals if we didn't already.
+			 */
+			if (!computed_nonnullable_vars)
+			{
+				nonnullable_vars = find_nonnullable_vars((Node *) usable_quals);
+				nonnullable_vars = (List *)
+					flatten_join_alias_vars(&subroot,
+											(Node *) nonnullable_vars);
+				/* We don't bother removing any non-Vars from the result */
+				computed_nonnullable_vars = true;
+			}
+
+			if (!list_member(nonnullable_vars, var))
+				return false;	/* we failed to prove the Var non-null */
+		}
+		else
+		{
+			/* Not a Const or Var; punt */
+			return false;
+		}
+	}
+
+	return true;				/* query cannot emit NULLs */
+}
+
+/*
+ * find_innerjoined_rels
+ *		Traverse jointree to locate non-outerjoined-rels and quals above them
+ *
+ * We fill innerjoined_rels with the relids of all rels that are not below
+ * the nullable side of any outer join (which would cause their Vars to be
+ * possibly NULL regardless of what's in the catalogs).  In the same scan,
+ * we locate all WHERE and JOIN/ON quals that constrain these rels, and add
+ * them to the usable_quals list (forming a list with implicit-AND semantics).
+ *
+ * Top-level caller must initialize innerjoined_rels/usable_quals to NULL/NIL.
+ */
+static void
+find_innerjoined_rels(Node *jtnode,
+					  Relids *innerjoined_rels, List **usable_quals)
+{
+	if (jtnode == NULL)
+		return;
+	if (IsA(jtnode, RangeTblRef))
+	{
+		int			varno = ((RangeTblRef *) jtnode)->rtindex;
+
+		*innerjoined_rels = bms_add_member(*innerjoined_rels, varno);
+	}
+	else if (IsA(jtnode, FromExpr))
+	{
+		FromExpr   *f = (FromExpr *) jtnode;
+		ListCell   *lc;
+
+		/* All elements of the FROM list are allowable */
+		foreach(lc, f->fromlist)
+			find_innerjoined_rels((Node *) lfirst(lc),
+								  innerjoined_rels, usable_quals);
+		/* ... and its WHERE quals are too */
+		if (f->quals)
+			*usable_quals = lappend(*usable_quals, f->quals);
+	}
+	else if (IsA(jtnode, JoinExpr))
+	{
+		JoinExpr   *j = (JoinExpr *) jtnode;
+
+		switch (j->jointype)
+		{
+			case JOIN_INNER:
+				/* visit both children */
+				find_innerjoined_rels(j->larg,
+									  innerjoined_rels, usable_quals);
+				find_innerjoined_rels(j->rarg,
+									  innerjoined_rels, usable_quals);
+				/* and grab the ON quals too */
+				if (j->quals)
+					*usable_quals = lappend(*usable_quals, j->quals);
+				break;
+
+			case JOIN_LEFT:
+			case JOIN_SEMI:
+			case JOIN_ANTI:
+
+				/*
+				 * Only the left input is possibly non-nullable; furthermore,
+				 * the quals of this join don't constrain the left input.
+				 * Note: we probably can't see SEMI or ANTI joins at this
+				 * point, but if we do, we can treat them like LEFT joins.
+				 */
+				find_innerjoined_rels(j->larg,
+									  innerjoined_rels, usable_quals);
+				break;
+
+			case JOIN_RIGHT:
+				/* Reverse of the above case */
+				find_innerjoined_rels(j->rarg,
+									  innerjoined_rels, usable_quals);
+				break;
+
+			case JOIN_FULL:
+				/* Neither side is non-nullable, so stop descending */
+				break;
+
+			default:
+				elog(ERROR, "unrecognized join type: %d",
+					 (int) j->jointype);
+		}
+	}
+	else
+		elog(ERROR, "unrecognized node type: %d",
+			 (int) nodeTag(jtnode));
+}
+
+/*
  * Can we treat a ScalarArrayOpExpr as strict?
  *
  * If "falseOK" is true, then a "false" result can be considered strict,
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4b5ef99..1d581a8 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -972,6 +972,33 @@ get_atttypetypmodcoll(Oid relid, AttrNumber attnum,
 	ReleaseSysCache(tp);
 }
 
+/*
+ * get_attnotnull
+ *
+ *		Given the relation id and the attribute number,
+ *		return the "attnotnull" field from the attribute relation.
+ */
+bool
+get_attnotnull(Oid relid, AttrNumber attnum)
+{
+	HeapTuple	tp;
+
+	tp = SearchSysCache2(ATTNUM,
+						 ObjectIdGetDatum(relid),
+						 Int16GetDatum(attnum));
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
+		bool		result;
+
+		result = att_tup->attnotnull;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+		return false;
+}
+
 /*				---------- COLLATION CACHE ----------					 */
 
 /*
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index dd991b1..90f745c 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -69,6 +69,7 @@ extern Relids find_nonnullable_rels(Node *clause);
 extern List *find_nonnullable_vars(Node *clause);
 extern List *find_forced_null_vars(Node *clause);
 extern Var *find_forced_null_var(Node *clause);
+extern bool expressions_are_not_nullable(Query *query, List *exprlist);
 
 extern bool is_pseudo_constant_clause(Node *clause);
 extern bool is_pseudo_constant_clause_relids(Node *clause, Relids relids);
diff --git a/src/include/optimizer/subselect.h b/src/include/optimizer/subselect.h
index 5607e98..3e8bfe7 100644
--- a/src/include/optimizer/subselect.h
+++ b/src/include/optimizer/subselect.h
@@ -18,6 +18,7 @@
 extern void SS_process_ctes(PlannerInfo *root);
 extern JoinExpr *convert_ANY_sublink_to_join(PlannerInfo *root,
 							SubLink *sublink,
+							bool under_not,
 							Relids available_rels);
 extern JoinExpr *convert_EXISTS_sublink_to_join(PlannerInfo *root,
 							   SubLink *sublink,
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index f46460a..3ec200a 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -70,6 +70,7 @@ extern Oid	get_atttype(Oid relid, AttrNumber attnum);
 extern int32 get_atttypmod(Oid relid, AttrNumber attnum);
 extern void get_atttypetypmodcoll(Oid relid, AttrNumber attnum,
 					  Oid *typid, int32 *typmod, Oid *collid);
+extern bool get_attnotnull(Oid relid, AttrNumber attnum);
 extern char *get_collation_name(Oid colloid);
 extern char *get_constraint_name(Oid conoid);
 extern Oid	get_opclass_family(Oid opclass);
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index d85a717..610c175 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -803,3 +803,372 @@ select nextval('ts1');
       11
 (1 row)
 
+--
+-- Check NOT IN performs an ANTI JOIN when NULLs are not possible
+-- on either side of the, would be, join condition.
+--
+BEGIN;
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+INSERT INTO b VALUES(1,1),(2,2),(3,NULL);
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a 
+LEFT OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT z FROM c);
+             QUERY PLAN             
+------------------------------------
+ Hash Right Join
+   Hash Cond: (b.x = a.id)
+   Filter: (NOT (hashed SubPlan 1))
+   ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+   SubPlan 1
+     ->  Seq Scan on c
+(8 rows)
+
+-- ANTI JOIN, b.x is from an outer join but b.x > 100
+-- forces the join not to produce NULL on the righthand
+-- side.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a 
+LEFT OUTER JOIN b ON a.id = b.x 
+WHERE b.x NOT IN(SELECT z FROM c)
+  AND b.x > 100;
+           QUERY PLAN            
+---------------------------------
+ Hash Join
+   Hash Cond: (b.x = a.id)
+   ->  Hash Anti Join
+         Hash Cond: (b.x = c.z)
+         ->  Seq Scan on b
+               Filter: (x > 100)
+         ->  Hash
+               ->  Seq Scan on c
+   ->  Hash
+         ->  Seq Scan on a
+(10 rows)
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a 
+FULL OUTER JOIN b ON a.id = b.x 
+WHERE b.x NOT IN(SELECT y FROM c);
+         QUERY PLAN          
+-----------------------------
+ Hash Full Join
+   Hash Cond: (b.x = a.id)
+   Filter: (NOT (SubPlan 1))
+   ->  Seq Scan on b
+   ->  Hash
+         ->  Seq Scan on a
+   SubPlan 1
+     ->  Seq Scan on c
+(8 rows)
+
+-- No ANTI JOIN. y can have NULLs
+EXPLAIN (COSTS OFF)
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on b
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on c
+(4 rows)
+
+-- c is an empty relation so should cause no filtering on b
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+ x | y 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 |  
+(3 rows)
+
+INSERT INTO c VALUES(1);
+-- Records where y is NULL should be filtered out.
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+ x | y 
+---+---
+ 2 | 2
+(1 row)
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+        QUERY PLAN         
+---------------------------
+ Nested Loop Anti Join
+   Join Filter: (a.id = 1)
+   ->  Seq Scan on a
+   ->  Materialize
+         ->  Seq Scan on b
+(5 rows)
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+(4 rows)
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+          QUERY PLAN           
+-------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: (y = 1)
+(6 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+               QUERY PLAN               
+----------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: ((y = 1) OR (x = 1))
+(5 rows)
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+                            QUERY PLAN                             
+-------------------------------------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Seq Scan on b
+           Filter: (((y = 1) OR (x = 1)) AND ((y = 2) OR (x = 2)))
+(5 rows)
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 2) AND ((y = 1) OR (x = 1)))
+(6 rows)
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+                 QUERY PLAN                 
+--------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = b.y)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Seq Scan on b
+               Filter: ((y = 1) OR (y = 2))
+(6 rows)
+
+-- No ANTI JOIN c.z is from a left outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Left Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- ANTI JOIN, c.z is not from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Merge Left Join
+   Merge Cond: (c.z = b.x)
+   ->  Sort
+         Sort Key: c.z
+         ->  Merge Anti Join
+               Merge Cond: (a.id = c.z)
+               ->  Index Only Scan using a_pkey on a
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+   ->  Sort
+         Sort Key: b.x
+         ->  Seq Scan on b
+(13 rows)
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Right Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- No ANTI JOIN, c.z is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Full Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b FULL JOIN c ON b.x = c.z);
+             QUERY PLAN             
+------------------------------------
+ Seq Scan on a
+   Filter: (NOT (hashed SubPlan 1))
+   SubPlan 1
+     ->  Merge Full Join
+           Merge Cond: (b.x = c.z)
+           ->  Sort
+                 Sort Key: b.x
+                 ->  Seq Scan on b
+           ->  Sort
+                 Sort Key: c.z
+                 ->  Seq Scan on c
+(11 rows)
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+(12 rows)
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+                QUERY PLAN                 
+-------------------------------------------
+ Hash Anti Join
+   Hash Cond: (a.id = c.z)
+   ->  Seq Scan on a
+   ->  Hash
+         ->  Nested Loop
+               ->  Seq Scan on c
+                     Filter: (z = 1)
+               ->  Materialize
+                     ->  Seq Scan on b
+                           Filter: (x = 1)
+(10 rows)
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+                    QUERY PLAN                     
+---------------------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = c.z)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.x = c.z)
+               ->  Sort
+                     Sort Key: b.x
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+                           Filter: (z IS NOT NULL)
+(13 rows)
+
+-- ANTI JOIN, b.y cannot be NULL due to the join condition b.y = c.z
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.y FROM b INNER JOIN c ON b.y = c.z);
+               QUERY PLAN                
+-----------------------------------------
+ Merge Anti Join
+   Merge Cond: (a.id = b.y)
+   ->  Index Only Scan using a_pkey on a
+   ->  Materialize
+         ->  Merge Join
+               Merge Cond: (b.y = c.z)
+               ->  Sort
+                     Sort Key: b.y
+                     ->  Seq Scan on b
+               ->  Sort
+                     Sort Key: c.z
+                     ->  Seq Scan on c
+(12 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index c3b4773..c4f9102 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -444,3 +444,123 @@ select * from
   order by 1;
 
 select nextval('ts1');
+
+--
+-- Check NOT IN performs an ANTI JOIN when NULLs are not possible
+-- on either side of the, would be, join condition.
+--
+
+BEGIN;
+
+CREATE TEMP TABLE a (id INT PRIMARY KEY);
+CREATE TEMP TABLE b (x INT NOT NULL, y INT);
+CREATE TEMP TABLE c (z INT NOT NULL);
+
+INSERT INTO b VALUES(1,1),(2,2),(3,NULL);
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a
+LEFT OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT z FROM c);
+
+-- ANTI JOIN, b.x is from an outer join but b.x > 100
+-- forces the join not to produce NULL on the righthand
+-- side.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a
+LEFT OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT z FROM c)
+  AND b.x > 100;
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a
+FULL OUTER JOIN b ON a.id = b.x
+WHERE b.x NOT IN(SELECT y FROM c);
+
+-- No ANTI JOIN. y can have NULLs
+EXPLAIN (COSTS OFF)
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+
+-- c is an empty relation so should cause no filtering on b
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+
+INSERT INTO c VALUES(1);
+
+-- Records where y is NULL should be filtered out.
+SELECT * FROM b WHERE y NOT IN (SELECT z FROM c);
+
+-- No ANTI JOIN, y can be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b);
+
+-- No ANTI JOIN, x is NOT NULL, but we don't know if + 1 will change that.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT x+1 FROM b);
+
+-- ANTI JOIN 1 is a Const that is not null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT 1 FROM b);
+
+-- No ANTI JOIN, results contain a NULL Const
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT NULL::int FROM b);
+
+-- ANTI JOIN y = 1 means y can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE y = 1 OR x = 1);
+
+-- No ANTI JOIN, OR condition does not ensure y = 1 or y = 2
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND (y = 2 OR x = 2));
+
+-- ANTI JOIN y must be 2, so can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR x = 1) AND y = 2);
+
+-- ANTI JOIN y can be 1 or 2, but can't be null.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT y FROM b WHERE (y = 1 OR y = 2));
+
+-- No ANTI JOIN c.z is from a left outer join so it can have nulls.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z is not from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b RIGHT JOIN c ON b.x = c.z);
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b RIGHT JOIN c ON b.x = c.z);
+
+-- No ANTI JOIN, c.z is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b FULL JOIN c ON b.x = c.z);
+
+-- No ANTI JOIN, b.x is from an outer join
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.x FROM b FULL JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z is from an inner join and has a NOT NULL constraint.
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b INNER JOIN c ON b.x = c.z);
+
+-- ANTI JOIN, c.z must be 1
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z = 1);
+
+-- ANTI JOIN, c.z can't be NULL
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT c.z FROM b LEFT JOIN c ON b.x = c.z WHERE c.z IS NOT NULL);
+
+-- ANTI JOIN, b.y cannot be NULL due to the join condition b.y = c.z
+EXPLAIN (COSTS OFF)
+SELECT * FROM a WHERE id NOT IN (SELECT b.y FROM b INNER JOIN c ON b.y = c.z);
+
+ROLLBACK;
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#43)
Re: Allowing NOT IN to use ANTI joins

David Rowley <dgrowleyml@gmail.com> writes:

I've made some changes to the patch so that it only allows the conversion
to ANTI JOIN to take place if both the outer query's expressions AND the
subquery's target list can be proved not to have NULLs.

This coding doesn't fill me with warm fuzzy feelings.
query_outputs_are_not_nullable, as originally constituted, knew that it
was attempting to prove the query's tlist non-nullable; that's the reason
for the setop restriction, and it also justifies looking at all the
available quals. If you're trying to make a similar proof for expressions
occurring in a random qual clause, I don't think you can safely look at
quals coming from higher syntactic nesting levels. And on the other side
of the coin, outer joins occurring above the syntactic level of the NOT IN
aren't reason to dismiss using an antijoin, because they don't null
variables appearing in it.

It might be possible to fix that by passing in the jointree node at which
the NOT IN is to be evaluated, and doing the find_innerjoined_rels search
for the outer-query exprs from there rather than always from the jointree
root. I've not thought carefully about this though.

I found that the call to is_NOTANY_compatible_with_antijoin adds about 0.2%
and 2.3% to total planning time. Though the 2.3% was quite an extreme case,
and the 0.2% was the most simple case I could think of.

Hm. Since, as you say, the cost is 0 unless there's a NOT IN, that seems
to indicate that we can afford this test ... as long as it does something
often enough to be useful. I'm still a bit doubtful about that. However,
it does give us the option of telling people that they can fix their
queries by adding "WHERE x IS NOT NULL", so maybe that's helpful enough
even if it doesn't fix real-world queries right out of the gate.

Since we're at the end of the June commitfest, I'm going to mark this
patch Returned With Feedback in the commitfest list.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Simon Riggs
simon@2ndQuadrant.com
In reply to: David Rowley (#43)
Re: Allowing NOT IN to use ANTI joins

On 15 July 2014 12:58, David Rowley <dgrowleyml@gmail.com> wrote:

I found that the call to is_NOTANY_compatible_with_antijoin adds about 0.2%
and 2.3% to total planning time. Though the 2.3% was quite an extreme case,
and the 0.2% was the most simple case I could think of.

I think its quite important that we don't apply every single
optimization we can think of in all cases. Fast planning of short
requests is as important as good planning of longer requests.

Is there a way we can only run this extra test when we have reasonable
idea that there is potential to save significant costs? Or put another
way, can we look at ways to skip the test if its not likely to add
value. Obviously, if we have a good feeling that we might save lots of
execution time then any additional planning time is easier to justify.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#45)
Re: Allowing NOT IN to use ANTI joins

Simon Riggs <simon@2ndQuadrant.com> writes:

On 15 July 2014 12:58, David Rowley <dgrowleyml@gmail.com> wrote:

I found that the call to is_NOTANY_compatible_with_antijoin adds about 0.2%
and 2.3% to total planning time. Though the 2.3% was quite an extreme case,
and the 0.2% was the most simple case I could think of.

Is there a way we can only run this extra test when we have reasonable
idea that there is potential to save significant costs?

Well, all of this cost occurs only when we find a NOT IN clause in a place
where we could conceivably turn it into an antijoin. I think it's
unquestionably a win to make that transformation if possible. My concern
about it is mainly that the whole thing is a band-aid for naively written
queries, and it seems likely to me that naively written queries aren't
going to be amenable to making the proof we need. But we can't tell that
for sure without doing exactly the work the patch does.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#46)
Re: Allowing NOT IN to use ANTI joins

On Wed, Jul 16, 2014 at 9:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 15 July 2014 12:58, David Rowley <dgrowleyml@gmail.com> wrote:

I found that the call to is_NOTANY_compatible_with_antijoin adds about

0.2%

and 2.3% to total planning time. Though the 2.3% was quite an extreme

case,

and the 0.2% was the most simple case I could think of.

Is there a way we can only run this extra test when we have reasonable
idea that there is potential to save significant costs?

Well, all of this cost occurs only when we find a NOT IN clause in a place
where we could conceivably turn it into an antijoin. I think it's
unquestionably a win to make that transformation if possible. My concern
about it is mainly that the whole thing is a band-aid for naively written
queries, and it seems likely to me that naively written queries aren't
going to be amenable to making the proof we need. But we can't tell that
for sure without doing exactly the work the patch does.

I do think Simon has a good point, maybe it's not something for this patch,
but perhaps other planner patches that potentially optimise queries that
could have been executed more efficiently if they had been written in
another way.

Since the planning time has been added to EXPLAIN ANALYZE I have noticed
that in many very simple queries that quite often planning time is longer
than execution time, so I really do understand the concern that we don't
want to slow the planner down for these super fast queries. But on the
other hand, if this extra 1-2 microseconds that the NOT IN optimisation was
being frowned upon and the patch had been rejected based on that, at the
other end of the scale, I wouldn't think it was too impossible for spending
that extra 2 microseconds to translate into shaving a few hours off of the
execution time of a query being run in an OLAP type workload. If that was
the case then having not spent the 2 extra microseconds seems quite funny,
but there's no real way to tell I guess unless we invented something to
skip the known costly optimisations on first pass then re-plan the whole
query with the planning strength knob turned to the maximum if the final
query cost more than N.

Off course such a idea would make bad plans even harder to debug, so it's
far from perfect, but I'm not seeing other options for the best of both
worlds.

Regards

David Rowley