SQL 'in' vs join.

Started by mlwover 25 years ago10 messageshackers

Jump to latest

mlw

markw@mohawksoft.com

over 25 years ago

Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

I can't understand why the first query isn't optimized better than the
second one. The 'in' query forces a full table scan (it shouldn't) and
the second one uses the indexes. Does anyone know why?

I know I am no SQL guru, but my gut tells me that the 'in' operator
should be far more efficient than a join.

Here are the actual queries:

cdinfo=# explain select trackid from zsong where muzenbr in (select
muzenbr from ztitles where title = 'Mulan') ;
NOTICE: QUERY PLAN:

Seq Scan on zsong (cost=100000000.00..219321449380756.66 rows=2193213
width=4)
SubPlan
-> Materialize (cost=100000022.50..100000022.50 rows=10 width=4)
-> Seq Scan on ztitles (cost=100000000.00..100000022.50
rows=10 width=4)

cdinfo=# explain select trackid from zsong, ztitles where
ztitles.muzenbr = zsong.muzenbr and title = 'Mulan' ;
NOTICE: QUERY PLAN:

Merge Join (cost=0.00..183664.10 rows=219321 width=12)
-> Index Scan using zsong_muznbr on zsong (cost=0.00..156187.31
rows=2193213 width=8)
-> Index Scan using ztitles_pkey on ztitles (cost=0.00..61.50
rows=10 width=4)

cdinfo=# \d ztitles
              Table "ztitles"
 Attribute  |       Type        | Modifier
------------+-------------------+----------
 muzenbr    | integer           | not null
 artistid   | integer           |
 cat2       | character varying |
 cat3       | character varying |
 cat4       | character varying |
 performer  | character varying |
 performer2 | character varying |
 title      | character varying |
 artist1    | character varying |
 engineer   | character varying |
 producer   | character varying |
 labelname  | character varying |
 catalog    | character varying |
 distribut  | character varying |
 released   | character varying |
 origrel    | character varying |
 nbrdiscs   | character varying |
 spar       | character varying |
 minutes    | character varying |
 seconds    | character varying |
 monostereo | character varying |
 studiolive | character varying |
 available  | character(1)      |
 previews   | character varying |
 pnotes     | character varying |
 acd        | character varying |
Index: ztitles_pkey

--
http://www.mohawksoft.com

mlw

markw@mohawksoft.com

over 25 years ago

In reply to: mlw (#1)

Re: SQL 'in' vs join.

Hannu Krosing wrote:

mlw wrote:

Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

I can't understand why the first query isn't optimized better than the
second one. The 'in' query forces a full table scan (it shouldn't) and
the second one uses the indexes. Does anyone know why?

Its not done yet, and probably hsomewhat difficult to do in a general
fashion

I know I am no SQL guru, but my gut tells me that the 'in' operator
should be far more efficient than a join.

Here are the actual queries:

cdinfo=# explain select trackid from zsong where muzenbr in (select
muzenbr from ztitles where title = 'Mulan') ;

try

explain
select trackid
from zsong
where muzenbr in (
select muzenbr
from ztitles
where title = 'Mulan'
and ztitles.muzenbr=zsong.muzenbr
);

this should hint the current optimizer to do the right thing;

-----------------
Hannu

Nope:

cdinfo=# explain
cdinfo-# select trackid
cdinfo-# from zsong
cdinfo-# where muzenbr in (
cdinfo(# select muzenbr
cdinfo(# from ztitles
cdinfo(# where title = 'Mulan'
cdinfo(# and ztitles.muzenbr=zsong.muzenbr
cdinfo(# );
NOTICE: QUERY PLAN:

Seq Scan on zsong (cost=100000000.00..104474515.18 rows=2193213
width=4)
SubPlan
-> Index Scan using ztitles_pkey on ztitles (cost=0.00..4.05
rows=1 width=4)

But what I also find odd is, look at the components:

cdinfo=# explain select muzenbr from ztitles where title = 'Mulan' ;
NOTICE: QUERY PLAN:

Index Scan using ztitles_title_ndx on ztitles (cost=0.00..7.08 rows=1
width=4)

cdinfo=# explain select trackid from zsong where muzenbr in ( 1,2,3,4,5)
;
NOTICE: QUERY PLAN:

Index Scan using zsong_muzenbr_ndx, zsong_muzenbr_ndx,
zsong_muzenbr_ndx, zsong_muzenbr_ndx, zsong_muzenbr_ndx on zsong
(cost=0.00..392.66 rows=102 width=4)

Now, given the two components, each with very low costs, it chooses to
do a sequential scan on the table. I don't get it. I have have been
having no end of problems with Postgres' optimizer. It just seems to be
brain dead at times. It is a huge point of frustration to me. I am tied
to postgres in my current project, and I fear that I will not be able to
implement certain features because of this sort of behavior.

--
http://www.mohawksoft.com

Andrew Snow

andrew@modulus.org

over 25 years ago

In reply to: mlw (#2)

RE: SQL 'in' vs join.

Now, given the two components, each with very low costs, it chooses to
do a sequential scan on the table. I don't get it.

Read the FAQ?

http://www.postgresql.org/docs/faq-english.html#4.23
"4.23) Why are my subqueries using IN so slow?")

- Andrew

Don Baccus

dhogaza@pacifier.com

over 25 years ago

In reply to: mlw (#2)

Re: SQL 'in' vs join.

At 08:37 AM 11/30/00 -0500, mlw wrote:

mlw wrote:

Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

Now, given the two components, each with very low costs, it chooses to
do a sequential scan on the table. I don't get it. I have have been
having no end of problems with Postgres' optimizer. It just seems to be
brain dead at times. It is a huge point of frustration to me. I am tied
to postgres in my current project, and I fear that I will not be able to
implement certain features because of this sort of behavior.

But but but ...

Not only is the join faster, but it is more readable and cleaner SQL as
well. I would never write the query in its first form. I'd change the
second one slightly to "select table1.* from ...", though, since those
are apparently the only fields you want.

The optimizer should do a better job on your first query, sure, but why
don't you like writing joins?

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

Hannu Krosing

hannu@tm.ee

over 25 years ago

In reply to: mlw (#1)

Re: SQL 'in' vs join.

mlw wrote:

Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

I can't understand why the first query isn't optimized better than the
second one. The 'in' query forces a full table scan (it shouldn't) and
the second one uses the indexes. Does anyone know why?

Its not done yet, and probably hsomewhat difficult to do in a general
fashion

I know I am no SQL guru, but my gut tells me that the 'in' operator
should be far more efficient than a join.

Here are the actual queries:

cdinfo=# explain select trackid from zsong where muzenbr in (select
muzenbr from ztitles where title = 'Mulan') ;

try

explain
select trackid
from zsong
where muzenbr in (
select muzenbr
from ztitles
where title = 'Mulan'
and ztitles.muzenbr=zsong.muzenbr
);

this should hint the current optimizer to do the right thing;

-----------------
Hannu

Hannu Krosing

hannu@tm.ee

over 25 years ago

In reply to: mlw (#1)

Re: SQL 'in' vs join.

mlw wrote:

Hannu Krosing wrote:

mlw wrote:

Why is a "select * from table1 where field in (select field from table2
where condition )"

is so dramatically bad compared to:

"select * from table1, table2 where table1.field = table2.field and
condition"

I can't understand why the first query isn't optimized better than the
second one. The 'in' query forces a full table scan (it shouldn't) and
the second one uses the indexes. Does anyone know why?

Its not done yet, and probably hsomewhat difficult to do in a general
fashion

I know I am no SQL guru, but my gut tells me that the 'in' operator
should be far more efficient than a join.

Here are the actual queries:

cdinfo=# explain select trackid from zsong where muzenbr in (select
muzenbr from ztitles where title = 'Mulan') ;

try

explain
select trackid
from zsong
where muzenbr in (
select muzenbr
from ztitles
where title = 'Mulan'
and ztitles.muzenbr=zsong.muzenbr
);

this should hint the current optimizer to do the right thing;

-----------------
Hannu

did you have indexes on both ztitles.muzenbr and zsong.muzenbr ?

--------------
Hannu

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Don Baccus (#4)

Re: SQL 'in' vs join.

Don Baccus <dhogaza@pacifier.com> writes:

The optimizer should do a better job on your first query, sure, but why
don't you like writing joins?

The join wouldn't give quite the same answers. If there are multiple
rows in table2 matching a particular table1 row, then a join would give
multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
way would give only one copy. SELECT DISTINCT can't be used to fix
this, because that would eliminate legitimate duplicates from identical
table1 rows.

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses
that appear at the top level of WHERE into this kind of join. Won't
happen for 7.1, though.

regards, tom lane

Don Baccus

dhogaza@pacifier.com

over 25 years ago

In reply to: Tom Lane (#7)

Re: SQL 'in' vs join.

At 10:52 AM 11/30/00 -0500, Tom Lane wrote:

Don Baccus <dhogaza@pacifier.com> writes:

The optimizer should do a better job on your first query, sure, but why
don't you like writing joins?

The join wouldn't give quite the same answers. If there are multiple
rows in table2 matching a particular table1 row, then a join would give
multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
way would give only one copy. SELECT DISTINCT can't be used to fix
this, because that would eliminate legitimate duplicates from identical
table1 rows.

Hmmm...I was presuming that "field" was a primary key of table1, so
such duplicates wouldn't exist (and SELECT DISTINCT would weed out
duplicates from table2 if "field" isn't a primary key of table2, i.e.
if table2 has a many-to-one relationship to table1). For many-to-many
relationships yes, you're right, the "in" version returns a different
result.

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses
that appear at the top level of WHERE into this kind of join. Won't
happen for 7.1, though.

Same trick could be used for some classes of queries which do a SELECT DISTINCT
on the results of a join, too ...

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

Bruce Momjian

bruce@momjian.us

over 25 years ago

In reply to: Tom Lane (#7)

Re: SQL 'in' vs join.

Don Baccus <dhogaza@pacifier.com> writes:

The optimizer should do a better job on your first query, sure, but why
don't you like writing joins?

The join wouldn't give quite the same answers. If there are multiple
rows in table2 matching a particular table1 row, then a join would give
multiple copies of the table1 row, whereas the WHERE foo IN (sub-select)
way would give only one copy. SELECT DISTINCT can't be used to fix
this, because that would eliminate legitimate duplicates from identical
table1 rows.

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses
that appear at the top level of WHERE into this kind of join. Won't
happen for 7.1, though.

Of course, we will have the query tree redesign for 7.2, right, make
that unnecessary.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#10

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Bruce Momjian (#9)

Re: SQL 'in' vs join.

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Now that the executor understands about multiple join rules (for
OUTER JOIN support), I've been thinking about inventing a new join rule
that says "at most one output row per left-hand row" --- this'd be sort
of the opposite of the LEFT OUTER JOIN rule, "at least one output row
per left-hand row" --- and then transforming IN (sub-select) clauses
that appear at the top level of WHERE into this kind of join. Won't
happen for 7.1, though.

Of course, we will have the query tree redesign for 7.2, right, make
that unnecessary.

No, I see that as part of the query tree redesign. You'd still need
executor support as above, but what remains to be seen is how hard is it
for the planner to do the transformation I so blithely posited ... and
do we need to change the querytree structure to make it easier?

regards, tom lane