IN vs EXIIST

Started by Jean-Christian Imbeaultover 23 years ago24 messagesgeneral
Jump to latest
#1Jean-Christian Imbeault
jc@mega-bucks.co.jp

I find myself writing a lot of queries with this pattern:

select distinct key1 from A where id not it
(select distinct key1 from A where x='false');

The reason being that key1 is not a primary key (key1, key2 is the
primary key). i.e. I have a table like this

key1 key2 x
------------------
a 1 t
a 2 t
a 3 f
b 1 t
b 2 t
b 3 t
c 3 t
c 4 f

So basically I want key1 values for which all the X's are true.

I've seen many posts saying that using IN is not optimal and replacing
it with EXISTS is much better. I've read the only docs but I can't
understand the difference between the two or how to convert.

Can someone point me to some other docs or explain to me how to convert?
Or is my table schema wrong?

Thanks!

Jc

#2Darren Ferguson
darren@crystalballinc.com
In reply to: Jean-Christian Imbeault (#1)
Re: IN vs EXIIST

On Thu, 19 Sep 2002, Jean-Christian Imbeault wrote:

I find myself writing a lot of queries with this pattern:

select distinct key1 from A where id not it
(select distinct key1 from A where x='false');

The reason being that key1 is not a primary key (key1, key2 is the
primary key). i.e. I have a table like this

key1 key2 x
------------------
a 1 t
a 2 t
a 3 f
b 1 t
b 2 t
b 3 t
c 3 t
c 4 f

So basically I want key1 values for which all the X's are true.

Why areyou using the sub select. If you just want all the key1 where x is
true then the following will work

SELECT DISTINCT(key1) FROM a WHERE x = TRUE;

If you want all rows and don't care about duplicates then remove the
distinct.

Hope this helps

I've seen many posts saying that using IN is not optimal and replacing
it with EXISTS is much better. I've read the only docs but I can't
understand the difference between the two or how to convert.

Can someone point me to some other docs or explain to me how to convert?
Or is my table schema wrong?

Thanks!

Jc

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
Darren Ferguson

#3Bill Gribble
grib@linuxdevel.com
In reply to: Jean-Christian Imbeault (#1)
Re: IN vs EXIIST

On Wed, 2002-09-18 at 21:40, Jean-Christian Imbeault wrote:

select distinct key1 from A where id not it
(select distinct key1 from A where x='false');

I've seen many posts saying that using IN is not optimal and replacing
it with EXISTS is much better. I've read the only docs but I can't
understand the difference between the two or how to convert.

I just did this today for a query and got a 200x speedup when converting
from IN to EXISTS. It's dependent on your specific query exactly how to
do the conversion, but generally it's just a question of moving a field
from the "outside" to the "inside" of the subselect. For me,

SELECT * FROM inventory WHERE itemid IN (
SELECT itemid FROM osinfo WHERE ostype = 'linux' );

became

SELECT * FROM inventory WHERE EXISTS (
SELECT itemid FROM osinfo
WHERE ostype = 'linux' and inventory.itemid = osinfo.itemid);

Same results, same database; first query took 8 sec, second took 39
msec.

YMMV
b.g.

Show quoted text

Can someone point me to some other docs or explain to me how to convert?
Or is my table schema wrong?

Thanks!

Jc

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#4Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#1)
Re: IN vs EXIIST

Bill Gribble wrote:

I just did this today for a query and got a 200x speedup when
converting from IN to EXISTS. It's dependent on your specific query
exactly how to do the conversion, but generally it's just a question

of >moving a field from the "outside" to the "inside" of the subselect.

If it's not asking too much could you recommend how I could fix this
query? It's a bit more complicated than yours and I can't seem to get
the syntax right.

select distinct invoice_id from invoice_li where received='true'
AND shipped='false' AND cancelled='false'
AND
(invoice_id not in
(
select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
)
OR ship_now='true'
)

Sorry for the formatting ...

Jc

#5Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#4)
Re: IN vs EXIIST

Henshall, Stuart - WCP wrote:

select distinct invoice_id from invoice_li where received='true'
AND shipped='false' AND cancelled='false'
AND
(NOT EXISTS
(
select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
)
OR ship_now='true'
)

Should work (but is untested).

I'll test it and let you know. Tanks!

As a side note there doesn't seem a point to having distinct on the
subquery in its original form either, so this could be removed to reduce
overhead.

Really? I though that reducing the number of results in the sub-select
would make the query more efficient since the outer query would have
less items to check.

I.e. it would be faster to check if something is in (1,2,3) than if it
is in (1,2,2,2,2,2,2,2,2,2,3). No?

Let me check that new optimized query!

Jc

#6Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#5)
Re: IN vs EXIIST

Henshall, Stuart - WCP wrote:

[deleted]

I tried your optimized query but it was muh slower. Here are the results:

##Query using EXISTS

$ !1173
time psql TMP -c "select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
(NOT EXISTS
(
select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
)
OR ship_now='true'
) "
count
-------
170
(1 row)

real 0m8.322s
user 0m0.010s
sys 0m0.000s

##Query using IN

$ !1175
time psql TMP -c "select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
(invoice_id not in
(
select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
)
OR ship_now='true'
) "
count
-------
170
(1 row)

real 0m0.234s
user 0m0.000s
sys 0m0.010s

Maybe EXISTS is not always faster than IN ?

After a "vacuum analyze" the numbers become:

#using EXISTS

real 0m3.229s
user 0m0.000s
sys 0m0.000s

#using IN

real 0m0.141s
user 0m0.000s
sys 0m0.000s

Jc

#7Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#6)
Re: IN vs EXIIST

Strangely enough doing an EXPLAIN on the two queries shows that using
EXISTS would be faster than IN ... even though it isn't ..

psql TMP -c "explain select count(distinct invoice_id) from invoice_li
where received='true' AND shipped='false' AND cancelled='false'
AND
(invoice_id not in
(
select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
)
OR ship_now='true'
) "
NOTICE: QUERY PLAN:

Aggregate (cost=17642485.70..17642485.70 rows=1 width=4)
-> Seq Scan on invoice_li (cost=0.00..17642460.40 rows=10120 width=4)
SubPlan
-> Materialize (cost=871.61..871.61 rows=1 width=4)
-> Unique (cost=871.61..871.61 rows=1 width=4)
-> Sort (cost=871.61..871.61 rows=1 width=4)
-> Seq Scan on invoice_li
(cost=0.00..871.60 rows=1 width=4)

EXPLAIN
$ psql TMP -c "explain select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
(NOT EXISTS
(
select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
)
OR ship_now='true'
) "
NOTICE: QUERY PLAN:

Aggregate (cost=4955505.10..4955505.10 rows=1 width=4)
-> Seq Scan on invoice_li (cost=0.00..4955479.80 rows=10120 width=4)
SubPlan
-> Index Scan using invoice_li_pkey on invoice_li sq_inv_li
(cost=0.00..244.79 rows=1 width=80)

EXPLAIN

Jc

#8Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#7)
Re: IN vs EXIIST

Nigel J. Andrews wrote:

I think you've perhaps got a caching issue in those numbers.

I was thinking the same thing too but reversing the order I execute them
in doesn't change the numbers at all.

I also tried executing the slow query five times in a row (hoping the
results would be cahced) but the speed was always the same.

maybe there is something in the query using IN that allows it to be
cached while the query using the EXISTS cannot be cached?

How can I clear the cache so that I can re-run the fast query to see
what it's speed is withouth caching?

Jc

#9Nigel J. Andrews
nandrews@investsystems.co.uk
In reply to: Jean-Christian Imbeault (#6)
Re: IN vs EXIIST

I think you've perhaps got a caching issue in those numbers.

Note, I haven't been following this thread at all so I don't know if the IN
version is the typical timing for the query. However, I would suggest that you
try reversing the order you test those to see what effect the caching of data
from the EXISTS version has on the IN timing. Or just repeat the each version
and use the second run timings.

On Thu, 19 Sep 2002, Jean-Christian Imbeault wrote:

Show quoted text

Henshall, Stuart - WCP wrote:

[deleted]

I tried your optimized query but it was muh slower. Here are the results:

##Query using EXISTS

$ !1173
time psql TMP -c "select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
(NOT EXISTS
(
select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
)
OR ship_now='true'
) "
count
-------
170
(1 row)

real 0m8.322s
user 0m0.010s
sys 0m0.000s

##Query using IN

$ !1175
time psql TMP -c "select count(distinct invoice_id) from invoice_li
where received='true'
AND shipped='false' AND cancelled='false'
AND
(invoice_id not in
(
select distinct invoice_id from invoice_li where received='false'
AND cancelled='false'
)
OR ship_now='true'
) "
count
-------
170
(1 row)

real 0m0.234s
user 0m0.000s
sys 0m0.010s

Maybe EXISTS is not always faster than IN ?

After a "vacuum analyze" the numbers become:

#using EXISTS

real 0m3.229s
user 0m0.000s
sys 0m0.000s

#using IN

real 0m0.141s
user 0m0.000s
sys 0m0.000s

#10Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Nigel J. Andrews (#9)
Re: IN vs EXIIST

Henshall, Stuart - WCP wrote:

It might be interesting to try removing the distinct clause from the
first query and seeing what difference that makes, both real and
supposed. Also maybe try seeing what difference it would make to the
EXISTS sub query to have distinct invoice_id rather than *.

Removing or addin a distinct did not change the results but ...

Your query with EXISTS is this:

psql TMP -c "select count(distinct invoice_id) from invoice_li where
received='true'
AND shipped='false' AND cancelled='false'
AND
(NOT EXISTS
(
select * from invoice_li AS sq_inv_li where received='false'
AND cancelled='false' AND invoice_li.invoice_id=sq_inv_li.invoice_id
)
OR ship_now='true'
) "

I am no SQL expert so I took it as face value (especially since I don't
quite underst EXISTS).

But I removed this "AND invoice_li.invoice_id=sq_inv_li.invoice_id "
from the subquery and I got the correct count *and* the same speed as
the IN query ...

Jc

#11Jochem van Dieten
jochemd@oli.tudelft.nl
In reply to: Jean-Christian Imbeault (#1)
Re: IN vs EXIIST

Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>:

The reason being that key1 is not a primary key (key1, key2 is the
primary key). i.e. I have a table like this

key1 key2 x
------------------
a 1 t
a 2 t
a 3 f
b 1 t
b 2 t
b 3 t
c 3 t
c 4 f

So basically I want key1 values for which all the X's are true.

SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
FROM table
GROUP BY key1
HAVING isTrue = 1

Or is my table schema wrong?

I generally don't design tables with composite keys. I find it to
complicated in many operations. But on occasion I have seen them being
used by others very efficiently.

Jochem

#12Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jochem van Dieten (#11)
Re: IN vs EXIIST

Henshall, Stuart - WCP wrote:

AND invoice_li.invoice_id=sq_inv_li.invoice_id
is the bit that tells it to check the id of the sub query and parent are
the same.
If you don't have this bit I believe it makes the subquery kind of
piontless

You are right. I had "ordered" my data set in between running the
queries. When I added some new data running the EXISTS query without the
AND a.id=b.id bit the results were fast *and* wrong ;)

I'm still scratching my head as to why EXISTS isn't faster *and* why
EXPLAIN says it should be ...

Jc

#13Jan Weerts
j.weerts@i-views.de
In reply to: Jean-Christian Imbeault (#12)
Re: IN vs EXIIST

Strangely enough doing an EXPLAIN on the two queries shows that using
EXISTS would be faster than IN ... even though it isn't ..

A sad sidenote: I am stuck here with a similar IN/EXIST problem. One
of our expensive queries contains NOT IN and IN as subqueries. As I
was advised on this list, I tried to replace IN with EXISTS. When
doing so for part of the query (omitting one of the IN subqueries) the
IN and EXIST versions are both about the same speed in execution
(about 30sec).

EXPLAIN tells me, that the EXIST version should be 15 times faster,
which it is not. Caching is also not an issue here.

EXPLAIN also shows, that both queries want to perform a sequential
scan on the outermost query part, instead of an index scan (where
clause on the primary key). If I replace the innermost query by the
results it gives (splitting the request in two requests), than the
planner uses the index scan and is in fact much faster!

My next plan is to switch from 7.1.3 to 7.2, but that requires some
planning, as the database is permamently used.

Regards
Jan

#14Jochem van Dieten
jochemd@oli.tudelft.nl
In reply to: Jan Weerts (#13)
Re: IN vs EXIIST

Quoting Jan Weerts <j.weerts@i-views.de>:

One
of our expensive queries contains NOT IN and IN as subqueries. As I
was advised on this list, I tried to replace IN with EXISTS. When
doing so for part of the query (omitting one of the IN subqueries)
the IN and EXIST versions are both about the same speed in execution
(about 30sec).

I am wondering if the NOT might negate any positive effects from the
switch to EXIST.

My next plan is to switch from 7.1.3 to 7.2, but that requires some
planning, as the database is permamently used.

I would say that is a good idea anyway, but maybe you can post your
query. There might be a way to totally rewrite it like the CASE & GROUP
BY idea for Jean-Christian (not that that idea is faster by definition,
but you never know).

Jochem

#15Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jan Weerts (#13)
Re: IN vs EXIIST

Jan Weerts wrote:

A sad sidenote: I am stuck here with a similar IN/EXIST problem. One of
our expensive queries contains NOT IN and IN as subqueries. As I was
advised on this list, I tried to replace IN with EXISTS. When doing so
for part of the query (omitting one of the IN subqueries) the IN and
EXIST versions are both about the same speed in execution (about 30sec).

At least you get the same speed. In my case replacing IN with EXISTS
makes it about 25X slower!

EXPLAIN tells me, that the EXIST version should be 15 times faster,
which it is not. Caching is also not an issue here.

Yup, same as me. BTW how do I clear the cache?

EXPLAIN also shows, that both queries want to perform a sequential scan
on the outermost query part, instead of an index scan (where clause on
the primary key).

To test this I added an index on outer query search term and
lo-and-behold ...

Just like you I get a seq scan on the outer part for both but the IN
query does a seq scan on the inner query while the EXISTS uses an index
scan. The EXISTS is still just as slow though ..

My next plan is to switch from 7.1.3 to 7.2, but that requires some
planning, as the database is permamently used.

I'm on 7.2 so I don't know if that will help ;)

Jc

#16Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#1)
Re: IN vs EXIIST

Jochem van Dieten wrote:

SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
FROM table
GROUP BY key1
HAVING isTrue = 1

I tried this but got an error:

psql TMP -c "select invoice_id, min(case when received then 1 else 0
end) as ok from invoice_li group by invoice_id having ok = 1"

ERROR: Attribute 'ok' not found

Jc

#17Jochem van Dieten
jochemd@oli.tudelft.nl
In reply to: Jean-Christian Imbeault (#16)
Re: IN vs EXIIST

Quoting Jean-Christian Imbeault <jc@mega-bucks.co.jp>:

Jochem van Dieten wrote:

SELECT key1, Min(CASE WHEN x THEN 1 ELSE 0 END) AS isTrue
FROM table
GROUP BY key1
HAVING isTrue = 1

I tried this but got an error:
psql TMP -c "select invoice_id, min(case when received then 1 else 0
end) as ok from invoice_li group by invoice_id having ok = 1"
ERROR: Attribute 'ok' not found

Sorry, not behind the console at the moment. Try either:

SELECT invoice_id
FROM (
SELECT invoice_id, Min(CASE WHEN received THEN 1 ELSE 0 END) AS ok
FROM invoice_li
GROUP BY invoice_id
) agg_invoices
WHERE ok = 1

or:

SELECT invoice_id, Min(CASE WHEN received THEN 1 ELSE 0 END) AS ok
FROM invoice_li
GROUP BY invoice_id
HAVING Min(CASE WHEN received THEN 1 ELSE 0 END) = 1

Make sure you have an index on invoice_id. I am not sure if it is
faster because of the case that needs to be done on each row, but I
think it is worth a shot.

Jochem

#18Andrew Sullivan
andrew@libertyrms.info
In reply to: Jean-Christian Imbeault (#6)
Re: IN vs EXIIST

On Thu, Sep 19, 2002 at 06:36:56PM +0900, Jean-Christian Imbeault wrote:

Maybe EXISTS is not always faster than IN ?

I haven't been reading this thread very carefully, but that one is
_certainly_ true. In testing, we have had a wide variety of results
with IN versus EXISTS. It seems to have to do with the relative size
of the two sides. I haven't come up with a perfect rule for deciding
yet.

NOT IN/NOT EXISTS is more often a clear-cut case: NOT EXISTS is
_almost_ (but only almost) always faster.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110
#19Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#1)
Re: IN vs EXIIST

Jochem van Dieten wrote:

Sorry, not behind the console at the moment. Try either:

Ok :) I tried both new versions and they work great.

But ... they are not faster than the IN version. They run at exactly the
same speed.

Another thing is that the IN query and yours don't seem to be so
dependent on the DB having been vacuum analyzed while the query with the
EXISTS does (it goes from 15 secs to 5secs after vacuuming)

I guess using IN in this case is not making the query slow.

In case it's important the table contains 16800 rows and returns 300
matches.

I'll be sticking with my IN query ...

Jc

#20Darren Ferguson
darren@crystalballinc.com
In reply to: Jean-Christian Imbeault (#19)
Re: IN vs EXIIST

I realized after he sent his notice this morning.

On Thu, 19 Sep 2002, Jean-Luc Lachance wrote:

This is not what he wants.

He wants all keys that have only true for x.

Why areyou using the sub select. If you just want all the key1 where x is
true then the following will work

SELECT DISTINCT(key1) FROM a WHERE x = TRUE;

If you want all rows and don't care about duplicates then remove the
distinct.

Hope this helps

I've seen many posts saying that using IN is not optimal and replacing
it with EXISTS is much better. I've read the only docs but I can't
understand the difference between the two or how to convert.

Can someone point me to some other docs or explain to me how to convert?
Or is my table schema wrong?

Thanks!

Jc

--
Darren Ferguson

#21Jean-Luc Lachance
jllachan@nsd.ca
In reply to: Darren Ferguson (#2)
#22Jean-Luc Lachance
jllachan@nsd.ca
In reply to: Jean-Christian Imbeault (#6)
#23Jean-Luc Lachance
jllachan@nsd.ca
In reply to: Jean-Christian Imbeault (#1)
#24Bruce Momjian
bruce@momjian.us
In reply to: Jean-Christian Imbeault (#7)