Allowing NOT IN to use ANTI joins

Started by David Rowleyalmost 12 years ago47 messages

dgrowleyml@gmail.com

almost 12 years ago

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

A simple example of this (without a subquery) is:

select 1 where 3 not in (1, 2, null); returns 0 rows because 3 <> NULL is
unknown.

The attached patch allows an ANTI-join plan to be generated in cases like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);

SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

To generate a plan like:
QUERY PLAN
-----------------------------------------------------------------
Hash Anti Join (cost=64.00..137.13 rows=1070 width=8)
Hash Cond: (a.b_id = b.id)
-> Seq Scan on a (cost=0.00..31.40 rows=2140 width=8)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)

But if we then do:
ALTER TABLE b ALTER COLUMN id DROP NOT NULL;

The plan will go back to the current behaviour of:

QUERY PLAN
-------------------------------------------------------------
Seq Scan on a (cost=40.00..76.75 rows=1070 width=8)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)

Comments are welcome

Regards

David Rowley

Martijn van Oosterhout

kleptog@svana.org

almost 12 years ago

In reply to: David Rowley (#1)

Re: Allowing NOT IN to use ANTI joins

On Mon, Jun 09, 2014 at 12:36:30AM +1200, David Rowley wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

Awesome. I've had a brief look at the patch and other than a line of
extraneous whitespace it looks sane.

Since it is only testing on NOT IN queries I don't think there are any
issues with it slowing down simple queries.

I also note you can't prove "id+1" not null. At first I thought you
might be able to prove this not null if the operator/function was
strict, but then I realised that strict only means "null if input is
null" not "output is only null if inputs are null". Pity.

Nice work.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

Marti Raudsepp

marti@juffo.org

almost 12 years ago

In reply to: David Rowley (#1)

Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS
queries and leaves NOT IN alone. The reason for this is because the values
returned by a subquery in the IN clause could have NULLs.

I believe the reason why this hasn't been done yet, is that the plan
becomes invalid when another backend modifies the nullability of the
column. To get it to replan, you'd have to introduce a dependency on
the "NOT NULL" constraint, but it's impossible for now because there's
no pg_constraint entry for NOT NULLs.

The only way to consistently guarantee nullability is through primary
key constraints. Fortunately that addresses most of the use cases of
NOT IN(), in my experience.

See the comment in check_functional_grouping:

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

The behavior you want seems somewhat similar to
check_functional_grouping; maybe you could unify it with your
targetListIsGuaranteedNotToHaveNulls at some level. (PS: that's one
ugly function name :)

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Vik Fearing

vik@postgresfriends.org

almost 12 years ago

In reply to: David Rowley (#1)

Re: Allowing NOT IN to use ANTI joins

On 06/08/2014 02:36 PM, David Rowley wrote:

+		if (!get_attnotnull(tle->resorigtbl, tle->resorigcol))
+			return false;

As Marti says, you can't do this because NOT NULL doesn't have an oid to
attach a dependency to. You'll have to restrict this test to primary
keys only for now.
--
Vik

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Marti Raudsepp (#3)

Re: Allowing NOT IN to use ANTI joins

Marti Raudsepp <marti@juffo.org> writes:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS
queries and leaves NOT IN alone. The reason for this is because the values
returned by a subquery in the IN clause could have NULLs.

I believe the reason why this hasn't been done yet, is that the plan
becomes invalid when another backend modifies the nullability of the
column. To get it to replan, you'd have to introduce a dependency on
the "NOT NULL" constraint, but it's impossible for now because there's
no pg_constraint entry for NOT NULLs.

I don't believe this is an issue, because we are only talking about a
*plan* depending on the NOT NULL condition. ALTER TABLE DROP NOT NULL
would result in a relcache inval event against the table, which would
result in invalidating all cached plans mentioning the table.

I forget exactly what context we were discussing needing a NOT NULL
constraint's OID for, but it would have to be something where the
dependency was longer-lived than a plan; perhaps semantics of a view?
The existing comparable case is that a view containing ungrouped
variable references is allowed if the GROUP BY includes a primary key,
which means the semantic validity of the view depends on the pkey.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 11 years ago

In reply to: David Rowley (#1)

Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 8, 2014 at 5:36 AM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

A simple example of this (without a subquery) is:

select 1 where 3 not in (1, 2, null); returns 0 rows because 3 <> NULL is
unknown.

The attached patch allows an ANTI-join plan to be generated in cases like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);

SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

To generate a plan like:
QUERY PLAN
-----------------------------------------------------------------
Hash Anti Join (cost=64.00..137.13 rows=1070 width=8)
Hash Cond: (a.b_id = b.id)
-> Seq Scan on a (cost=0.00..31.40 rows=2140 width=8)
-> Hash (cost=34.00..34.00 rows=2400 width=4)
-> Seq Scan on b (cost=0.00..34.00 rows=2400 width=4)

I think this will be great, I've run into this problem often from
applications I have no control over. I thought a more complete, but
probably much harder, solution would be to add some metadata to the hash
anti-join infrastructure that tells it "If you find any nulls in the outer
scan, stop running without returning any rows". I think that should work
because the outer rel already has to run completely before any rows can be
returned.

But what I can't figure out is, would that change obviate the need for your
change? Once we can correctly deal with nulls in a NOT IN list through a
hash anti join, is there a cost estimation advantage to being able to prove
that the that null can't occur? (And of course if you have code that
works, while I have vague notions of what might be, then my notion probably
does not block your code.)

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Jeff Janes (#6)

Re: Allowing NOT IN to use ANTI joins

Jeff Janes <jeff.janes@gmail.com> writes:

On Sun, Jun 8, 2014 at 5:36 AM, David Rowley <dgrowleyml@gmail.com> wrote:

The attached patch allows an ANTI-join plan to be generated in cases like:
CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);
SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

I think this will be great, I've run into this problem often from
applications I have no control over. I thought a more complete, but
probably much harder, solution would be to add some metadata to the hash
anti-join infrastructure that tells it "If you find any nulls in the outer
scan, stop running without returning any rows". I think that should work
because the outer rel already has to run completely before any rows can be
returned.

Huh? The point of an antijoin (or indeed most join methods) is that we
*don't* have to examine the whole inner input to make a decision.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 11 years ago

In reply to: Tom Lane (#7)

Re: Allowing NOT IN to use ANTI joins

On Monday, June 9, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com <javascript:;>> writes:

On Sun, Jun 8, 2014 at 5:36 AM, David Rowley <dgrowleyml@gmail.com

<javascript:;>> wrote:

The attached patch allows an ANTI-join plan to be generated in cases

like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);
SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

I think this will be great, I've run into this problem often from
applications I have no control over. I thought a more complete, but
probably much harder, solution would be to add some metadata to the hash
anti-join infrastructure that tells it "If you find any nulls in the

outer

scan, stop running without returning any rows". I think that should work
because the outer rel already has to run completely before any rows can

be

returned.

Huh? The point of an antijoin (or indeed most join methods) is that we
*don't* have to examine the whole inner input to make a decision.

But all hash join methods needs to examine the entire *outer* input, no?
Have I screwed up my terminology here?

If you are using NOT IN, then once you find a NULL in the outer input (if
the outer input is the in-list: clearly you can't reverse the two inputs in
this case), you don't even need to finish reading the outer input, nor
start reading the inner input, because all rows are automatically excluded
by the weird semantics of NOT IN.

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Jeff Janes (#8)

Re: Allowing NOT IN to use ANTI joins

Jeff Janes <jeff.janes@gmail.com> writes:

On Monday, June 9, 2014, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Huh? The point of an antijoin (or indeed most join methods) is that we
*don't* have to examine the whole inner input to make a decision.

But all hash join methods needs to examine the entire *outer* input, no?
Have I screwed up my terminology here?

I think you're confusing inner and outer inputs --- the sub-select inside
the NOT IN is the inner input according to the way I think about it.
But I had assumed that was what you meant already.

If you are using NOT IN, then once you find a NULL in the outer input (if
the outer input is the in-list: clearly you can't reverse the two inputs in
this case), you don't even need to finish reading the outer input, nor
start reading the inner input, because all rows are automatically excluded
by the weird semantics of NOT IN.

The point I'm trying to make is that the goal of most join types is to
give an answer without having necessarily read all of either input.
For instance, if we tried to do this with a mergejoin it wouldn't work
reliably: it might suppose that it could accept an outer row on the basis
of no match in a higher-order sort column before it'd reached any nulls
appearing in lower-order sort columns.

You might be right that we could hot-wire the hash join case in
particular, but I'm failing to see the point of expending lots of extra
effort in order to deliver a useless answer faster. If there are NULLs
in the inner input, then NOT IN is simply the wrong query to make, and
giving an empty output in a relatively short amount of time isn't going
to help clue the user in on that. (If the SQL standard would let us do
so, I'd be arguing for throwing an error if we found a NULL.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Marti Raudsepp (#3)

Re: Allowing NOT IN to use ANTI joins

On Mon, Jun 9, 2014 at 11:20 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT

EXISTS

queries and leaves NOT IN alone. The reason for this is because the

values

returned by a subquery in the IN clause could have NULLs.

I believe the reason why this hasn't been done yet, is that the plan
becomes invalid when another backend modifies the nullability of the
column. To get it to replan, you'd have to introduce a dependency on
the "NOT NULL" constraint, but it's impossible for now because there's
no pg_constraint entry for NOT NULLs.

The only way to consistently guarantee nullability is through primary
key constraints. Fortunately that addresses most of the use cases of
NOT IN(), in my experience.

I tried to break this by putting a break point
in convert_ANY_sublink_to_join in session 1. Not that it really had to be
in that function, I just wanted it to stop during planning and before the
plan is executed.

-- session 1
select * from n1 where id not in(select id from n1); -- hits breakpoint in
convert_ANY_sublink_to_join

-- session 2
alter table n2 alter column id drop not null;

insert into n2 values(null);

I see that session 2 blocks in the alter table until session 1 completes.

I've not really checked out the code in detail around when the snapshot is
taken and the transaction ID is generated, but as long as the transaction
id is taken before we start planning in session 1 then it should not matter
if another session drops the constraint and inserts a NULL value as we
won't see that NULL value in our transaction... I'd assume that the
transaction has to start before it grabs the table defs that are required
for planning. Or have I got something wrong?

See the comment in check_functional_grouping:

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique
constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints
added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

I saw that, but I have to say I've not fully got my head around why that's
needed just yet.

The behavior you want seems somewhat similar to
check_functional_grouping; maybe you could unify it with your
targetListIsGuaranteedNotToHaveNulls at some level. (PS: that's one
ugly function name :)

Agreed :) Originally I had put the code that does that
in convert_ANY_sublink_to_join, but at the last minute before posting the
patch I decided that it might be useful and reusable so moved it out to
that function. I'll try and think of something better, but I'm open to
ideas.

Regards

David Rowley

#11

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Tom Lane (#9)

Re: Allowing NOT IN to use ANTI joins

On Tue, Jun 10, 2014 at 2:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

If you are using NOT IN, then once you find a NULL in the outer input (if
the outer input is the in-list: clearly you can't reverse the two inputs

in

this case), you don't even need to finish reading the outer input, nor
start reading the inner input, because all rows are automatically

excluded

by the weird semantics of NOT IN.

The point I'm trying to make is that the goal of most join types is to
give an answer without having necessarily read all of either input.
For instance, if we tried to do this with a mergejoin it wouldn't work
reliably: it might suppose that it could accept an outer row on the basis
of no match in a higher-order sort column before it'd reached any nulls
appearing in lower-order sort columns.

You might be right that we could hot-wire the hash join case in
particular, but I'm failing to see the point of expending lots of extra
effort in order to deliver a useless answer faster. If there are NULLs
in the inner input, then NOT IN is simply the wrong query to make, and
giving an empty output in a relatively short amount of time isn't going
to help clue the user in on that. (If the SQL standard would let us do
so, I'd be arguing for throwing an error if we found a NULL.)

This got me thinking. It is probably a bit useless and wrong to perform a
NOT IN when the subquery in the IN() clause can have NULL values, so I
guess in any realistic useful case, the user will either have a NOT NULL
constraint on the columns, or they'll do a WHERE col IS NOT NULL, so I
should likely also allow a query such as:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

to also perform an ANTI JOIN. I think it's just a matter of
changing targetListIsGuaranteedNotToHaveNulls so that if it does not find
the NOT NULL constraint, to check the WHERE clause of the query to see if
there's any not null quals.

I'm about to put this to the test, but if it works then I think it should
cover many more cases for using NOT IN(), I guess we're only leaving out
function calls and calculations in the target list.

Regards

David Rowley

#12

Marti Raudsepp

marti@juffo.org

over 11 years ago

In reply to: David Rowley (#10)

Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 11:53 AM, David Rowley <dgrowleyml@gmail.com> wrote:

The only way to consistently guarantee nullability is through primary
key constraints. Fortunately that addresses most of the use cases of
NOT IN(), in my experience.

See the comment in check_functional_grouping:

I saw that, but I have to say I've not fully got my head around why that's
needed just yet.

I was wrong, see Tom's reply to my email. It's OK to rely on
attnotnull for optimization decisions. The plan will be invalidated
automatically when the nullability of a referenced column changes.

check_functional_grouping needs special treatment because it decides
whether to accept/reject views; and if it has allowed creating a view,
it needs to guarantee that the dependent constraint isn't dropped for
a longer term.

as long as the transaction id
is taken before we start planning in session 1 then it should not matter if
another session drops the constraint and inserts a NULL value as we won't
see that NULL value in our transaction... I'd assume that the transaction
has to start before it grabs the table defs that are required for planning.
Or have I got something wrong?

1. You're assuming that query plans can only survive for the length of
a transaction. That's not true, prepared query plans can span many
transactions.

2. Also a FOR UPDATE clause can return values "from the future", if
another transaction has modified the value and already committed.

But this whole issue is moot anyway, the plan will get invalidated
when the nullability changes.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Marti Raudsepp

marti@juffo.org

over 11 years ago

In reply to: David Rowley (#1)

Re: Allowing NOT IN to use ANTI joins

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT EXISTS
queries and leaves NOT IN alone. The reason for this is because the values
returned by a subquery in the IN clause could have NULLs.

There's a bug in targetListIsGuaranteedNotToHaveNulls, you have to
drill deeper into the query to guarantee the nullability of a result
column. If a table is OUTER JOINed, it can return NULLs even if the
original column specification has NOT NULL.

This test case produces incorrect results with your patch:

create table a (x int not null);
create table b (x int not null, y int not null);
insert into a values(1);
select * from a where x not in (select y from a left join b using (x));

Unpatched version correctly returns 0 rows since "y" will be NULL.
Your patch returns the value 1 from a.

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Marti Raudsepp (#13)

Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 9:32 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT

EXISTS

queries and leaves NOT IN alone. The reason for this is because the

values

returned by a subquery in the IN clause could have NULLs.

There's a bug in targetListIsGuaranteedNotToHaveNulls, you have to
drill deeper into the query to guarantee the nullability of a result
column. If a table is OUTER JOINed, it can return NULLs even if the
original column specification has NOT NULL.

This test case produces incorrect results with your patch:

create table a (x int not null);
create table b (x int not null, y int not null);
insert into a values(1);
select * from a where x not in (select y from a left join b using (x));

Unpatched version correctly returns 0 rows since "y" will be NULL.
Your patch returns the value 1 from a.

Thanks, I actually was just looking at that. I guess I'll just need to make
sure that nothing in the targetlist comes from an outer join.

Regards

David Rowley

#15

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Marti Raudsepp (#12)

Re: Allowing NOT IN to use ANTI joins

Marti Raudsepp <marti@juffo.org> writes:

On Wed, Jun 11, 2014 at 11:53 AM, David Rowley <dgrowleyml@gmail.com> wrote:

as long as the transaction id
is taken before we start planning in session 1 then it should not matter if
another session drops the constraint and inserts a NULL value as we won't
see that NULL value in our transaction... I'd assume that the transaction
has to start before it grabs the table defs that are required for planning.
Or have I got something wrong?

1. You're assuming that query plans can only survive for the length of
a transaction. That's not true, prepared query plans can span many
transactions.

2. Also a FOR UPDATE clause can return values "from the future", if
another transaction has modified the value and already committed.

But this whole issue is moot anyway, the plan will get invalidated
when the nullability changes.

Right. The key point for David's concern is that we always hold (at
least) AccessShareLock on every relation used in a query, continuously
from rewrite through to the end of execution. This will block any attempt
by other transactions to make schema changes in the relation(s).
In the case of re-using a prepared plan, we re-acquire all these locks
and then check to see if we received any invalidation messages that
render the plan invalid; if not, we can proceed to execution with the same
safety guarantees as originally. (If we did, we replan starting from the
raw parse tree.)

If we didn't have mechanisms like this, we'd have far worse hazards from
ALTER TABLE than whether the planner made an incorrect join optimization.
Consider ALTER COLUMN TYPE for instance.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Bruce Momjian

bruce@momjian.us

over 11 years ago

In reply to: Tom Lane (#15)

Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we didn't have mechanisms like this, we'd have far worse hazards from
ALTER TABLE than whether the planner made an incorrect join optimization.
Consider ALTER COLUMN TYPE for instance.

Obviously not general cases of ALTER COLUMN TYPE but dropping a NULL
constraint seems like the kind of change targeted by Simon's "reduce
lock strength" patch that I'm sure he's still interested in. I think
that patch, while full of dragons to steer around, is something that
will keep coming up again and again in the future. It's a huge
operational risk that even these short exclusive locks can cause a
huge production outage if they happen to get queued up behind a
reporting query.

I don't think it changes anything for this patch -- right now the
world is arranged the way Tom described -- but it's something to keep
in mind when we talk about lock strength reduction and the impact on
existing queries. For example if there's an UPDATE query in repeatable
read mode that has an IN clause like this and was optimized
accordingly then any lock strength reduction patch would have to
beware that an ALTER TABLE that dropped the NULL clause might impact
the update query.

Incidentally, Oracle has a feature for online schema changes that we
might end up having to implement something similar. The good news is
we have the infrastructure to maybe do it. The idea is to start
capturing all the changes to the table using something like our
logical changeset extraction. Then do the equivalent of "create
newtable as select ... from oldtable" to create the new schema, then
start replaying the accumulated changes to the new table. Eventually
when the change queue drains then get an exclusive lock, drain any new
changes, and swap in the new table with the new schema.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Marti Raudsepp (#13)

Re: Allowing NOT IN to use ANTI joins

On Wed, Jun 11, 2014 at 9:32 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Sun, Jun 8, 2014 at 3:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:

Currently pull_up_sublinks_qual_recurse only changes the plan for NOT

EXISTS

queries and leaves NOT IN alone. The reason for this is because the

values

returned by a subquery in the IN clause could have NULLs.

There's a bug in targetListIsGuaranteedNotToHaveNulls, you have to
drill deeper into the query to guarantee the nullability of a result
column. If a table is OUTER JOINed, it can return NULLs even if the
original column specification has NOT NULL.

This test case produces incorrect results with your patch:

create table a (x int not null);
create table b (x int not null, y int not null);
insert into a values(1);
select * from a where x not in (select y from a left join b using (x));

Unpatched version correctly returns 0 rows since "y" will be NULL.
Your patch returns the value 1 from a.

I'm a bit stuck on fixing this and I can't quite figure out how I should
tell if the TargetEntry is coming from an outer join.

My first attempt does not work as it seems that I'm looking up the wrong
RangeTblEntry with the following:

rte = rt_fetch(tlevar->varno, query->rtable);

if (IS_OUTER_JOIN(rte->jointype))
return true; /* Var from an outer join */

The jointype returns JOIN_INNER when loooking up the RangeTblEntry from the
TargetEntry's varno. It seems that the RangeTblEntry that I need is stored
in query->rtable, but I've just no idea how to tell which item in the list
it is. So if anyone can point me in the right direction then that would be
really useful.

On a more positive or even slightly exciting note I think I've managed to
devise a way that ANTI JOINS can be used for NOT IN much more often. It
seems that find_nonnullable_vars will analyse a quals list to find
expressions that mean that the var cannot be NULL. This means we can
perform ANTI JOINS for NOT IN with queries like:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col = 1);
or
SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

(The attached patch implements this)

the nullable_col =1 will mean that nullable_col cannot be NULL, so the ANTI
JOIN can be performed safely. I think this combined with the NOT NULL check
will cover probably just about all valid uses of NOT IN with a subquery...
unless of course I've assumed something wrongly about
find_nonnullable_vars. I just need the correct RangeTblEntry in order to
determine if the TargetEntry is from an out join.

The attached patch is a broken implemention that still needs the lookup
code fixed to reference the correct RTE. The failing regression tests show
where the problems lie.

Any help on this would be really appreciated.

Regards

David Rowley

#18

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: David Rowley (#17)

Re: Allowing NOT IN to use ANTI joins

On 24 June 2014 11:32, David Rowley <dgrowleyml@gmail.com> wrote:

So if anyone can point me in the right direction then that would be
really useful.

Many things can be added simply, but most things can't. It seems we
just don't have that information. If we did, Tom would have done this
already.

On a more positive or even slightly exciting note I think I've managed to
devise a way that ANTI JOINS can be used for NOT IN much more often. It
seems that find_nonnullable_vars will analyse a quals list to find
expressions that mean that the var cannot be NULL. This means we can perform
ANTI JOINS for NOT IN with queries like:

SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col = 1);
or
SELECT * FROM a WHERE id NOT IN(SELECT nullable_col FROM b WHERE
nullable_col IS NOT NULL);

(The attached patch implements this)

the nullable_col =1 will mean that nullable_col cannot be NULL, so the ANTI
JOIN can be performed safely. I think this combined with the NOT NULL check
will cover probably just about all valid uses of NOT IN with a subquery...
unless of course I've assumed something wrongly about find_nonnullable_vars.
I just need the correct RangeTblEntry in order to determine if the
TargetEntry is from an out join.

This is the better way to go. It's much better to have explicit proof
its not null than a possibly long chain of metadata that might be
buggy.

The attached patch is a broken implemention that still needs the lookup code
fixed to reference the correct RTE. The failing regression tests show where
the problems lie.

Any help on this would be really appreciated.

I'd suggest we just drop the targetlist approach completely.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Bruce Momjian (#16)

Re: Allowing NOT IN to use ANTI joins

On 11 June 2014 17:52, Greg Stark <stark@mit.edu> wrote:

On Wed, Jun 11, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If we didn't have mechanisms like this, we'd have far worse hazards from
ALTER TABLE than whether the planner made an incorrect join optimization.
Consider ALTER COLUMN TYPE for instance.

Obviously not general cases of ALTER COLUMN TYPE but dropping a NULL
constraint seems like the kind of change targeted by Simon's "reduce
lock strength" patch that I'm sure he's still interested in. I think
that patch, while full of dragons to steer around, is something that
will keep coming up again and again in the future. It's a huge
operational risk that even these short exclusive locks can cause a
huge production outage if they happen to get queued up behind a
reporting query.

The focus of the lock strength reduction was around actions that lock
the table for extended periods. So it was mostly about adding things.
All the DROP actions are still AccessExclusiveLocks and will be for a
while.

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.
If we are relaxing a constraint, then a writable query that still
thinks a constraint exists won't cause a problem - it may error out
when it need not, but that's not so bad as to be worth worrying about.

So I think we can remove a NOT NULL constraint without too much problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Simon Riggs (#19)

Re: Allowing NOT IN to use ANTI joins

Simon Riggs <simon@2ndQuadrant.com> writes:

Having said that, any join plan that relies upon a constraint will
still be valid even if we drop a constraint while the plan executes
because any new writes will not be visible to the executing join plan.

mumble ... EvalPlanQual ?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Tom Lane (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Simon Riggs (#21)

#23

Simon Riggs