Curious why planner can't handle NOT IN

Started by Yang Zhangalmost 13 years ago3 messagesgeneral
Jump to latest
#1Yang Zhang
yanghatespam@gmail.com

I have an `account` table with 5.3M rows, with primary key `id` of
type `text` (and 600+ columns if that matters).

I'm trying to create a `newaccount` table with the same schema but
600k newly imported rows, then insert all the old rows for which `id`
isn't already in the set of newly updated rows.

=> create table newaccount as select * from account limit 0;

=> \copy newaccount from stdin with (format csv)
[...copy 600k rows in...]

=> analyze newaccount;

The most obvious query doesn't work so hot due to repeated execution
of the subplan:

=> explain insert into newaccount select * from account where id not
in (select id from account);

QUERY PLAN
-----------------------------------------------------------------------------------------
Insert on newaccount (cost=0.00..458800400224.89 rows=2985976 width=5366)
-> Seq Scan on account (cost=0.00..458800400224.89 rows=2985976 width=5366)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..153276.00 rows=150000 width=32)
-> Seq Scan on newaccount (cost=0.00..151500.00
rows=150000 width=32)
(6 rows)

This works fine, though:

=> explain insert into newaccount select * from account where id in
(select id from account except select id from newaccount);

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Insert on newaccount (cost=3749026.24..14034547.77 rows=2985976 width=5366)
-> Hash Join (cost=3749026.24..14034547.77 rows=2985976 width=5366)
Hash Cond: (account.id = "ANY_subquery".id)
-> Seq Scan on account (cost=0.00..2263744.51 rows=5971951
width=5366)
-> Hash (cost=3633552.85..3633552.85 rows=5971951 width=32)
-> Subquery Scan on "ANY_subquery"
(cost=3543223.59..3633552.85 rows=5971951 width=32)
-> SetOp Except (cost=3543223.59..3573833.34
rows=5971951 width=19)
-> Sort (cost=3543223.59..3558528.47
rows=6121951 width=19)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..2476464.02
rows=6121951 width=19)
-> Subquery Scan on "*SELECT*
1" (cost=0.00..2323464.02 rows=5971951 width=19)
-> Seq Scan on account
(cost=0.00..2263744.51 rows=5971951 width=19)
-> Subquery Scan on "*SELECT*
2" (cost=0.00..153000.00 rows=150000 width=32)
-> Seq Scan on
newaccount (cost=0.00..151500.00 rows=150000 width=32)
(14 rows)

This is all in PG 9.1.

This isn't a big deal as there's a straightforward workaround, but I
am curious what happened here. Googling turns up various mentions of
"NOT IN" with poor plans that involve subplans. Then again I have
read mention of hash anti-join which seems appropriate here(?), but
that wasn't used here (nor was the latter join used, though for
whatever reason it looks like a more complex/deeper plan tree than I
had expected, so maybe it was out of the plan generator's reach?).

E.g. the following mentions cranking up work_mem, but I probably can't
crank up work_mem to meet the requirements of this example, and even
if it doesn't fit in memory, it'd be nice for the planner to not
degenerate to a pathological plan and still execute this join
efficiently while spilling to and from disk.

http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2

Thanks!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yang Zhang (#1)
Re: Curious why planner can't handle NOT IN

Yang Zhang <yanghatespam@gmail.com> writes:

The most obvious query doesn't work so hot due to repeated execution
of the subplan:

=> explain insert into newaccount select * from account where id not
in (select id from account);

Yeah. Try using a NOT EXISTS instead.

This isn't a big deal as there's a straightforward workaround, but I
am curious what happened here. Googling turns up various mentions of
"NOT IN" with poor plans that involve subplans. Then again I have
read mention of hash anti-join which seems appropriate here(?), but
that wasn't used here

The trouble with NOT IN is that it's not exactly the same as an
antijoin, because of the spec-mandated bizarre behavior for NULLs.
It's very difficult to optimize it to any extent without producing
wrong answers. NOT EXISTS avoids that problem.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3pradeep singh
pradeepsingh1988@gmail.com
In reply to: Yang Zhang (#1)
Re: Curious why planner can't handle NOT IN

On May 3, 2013 12:15 PM, "Yang Zhang" <yanghatespam@gmail.com> wrote:

Show quoted text

I have an `account` table with 5.3M rows, with primary key `id` of
type `text` (and 600+ columns if that matters).

I'm trying to create a `newaccount` table with the same schema but
600k newly imported rows, then insert all the old rows for which `id`
isn't already in the set of newly updated rows.

=> create table newaccount as select * from account limit 0;

=> \copy newaccount from stdin with (format csv)
[...copy 600k rows in...]

=> analyze newaccount;

The most obvious query doesn't work so hot due to repeated execution
of the subplan:

=> explain insert into newaccount select * from account where id not
in (select id from account);

QUERY PLAN

-----------------------------------------------------------------------------------------
Insert on newaccount (cost=0.00..458800400224.89 rows=2985976 width=5366)
-> Seq Scan on account (cost=0.00..458800400224.89 rows=2985976
width=5366)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..153276.00 rows=150000 width=32)
-> Seq Scan on newaccount (cost=0.00..151500.00
rows=150000 width=32)
(6 rows)

This works fine, though:

=> explain insert into newaccount select * from account where id in
(select id from account except select id from newaccount);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
Insert on newaccount (cost=3749026.24..14034547.77 rows=2985976
width=5366)
-> Hash Join (cost=3749026.24..14034547.77 rows=2985976 width=5366)
Hash Cond: (account.id = "ANY_subquery".id)
-> Seq Scan on account (cost=0.00..2263744.51 rows=5971951
width=5366)
-> Hash (cost=3633552.85..3633552.85 rows=5971951 width=32)
-> Subquery Scan on "ANY_subquery"
(cost=3543223.59..3633552.85 rows=5971951 width=32)
-> SetOp Except (cost=3543223.59..3573833.34
rows=5971951 width=19)
-> Sort (cost=3543223.59..3558528.47
rows=6121951 width=19)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..2476464.02
rows=6121951 width=19)
-> Subquery Scan on "*SELECT*
1" (cost=0.00..2323464.02 rows=5971951 width=19)
-> Seq Scan on account
(cost=0.00..2263744.51 rows=5971951 width=19)
-> Subquery Scan on "*SELECT*
2" (cost=0.00..153000.00 rows=150000 width=32)
-> Seq Scan on
newaccount (cost=0.00..151500.00 rows=150000 width=32)
(14 rows)

This is all in PG 9.1.

This isn't a big deal as there's a straightforward workaround, but I
am curious what happened here. Googling turns up various mentions of
"NOT IN" with poor plans that involve subplans. Then again I have
read mention of hash anti-join which seems appropriate here(?), but
that wasn't used here (nor was the latter join used, though for
whatever reason it looks like a more complex/deeper plan tree than I
had expected, so maybe it was out of the plan generator's reach?).

E.g. the following mentions cranking up work_mem, but I probably can't
crank up work_mem to meet the requirements of this example, and even
if it doesn't fit in memory, it'd be nice for the planner to not
degenerate to a pathological plan and still execute this join
efficiently while spilling to and from disk.

http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2

Thanks!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general