Subqueries and indexes

Started by Bruce Momjianalmost 27 years ago7 messages
#1Bruce Momjian
maillist@candle.pha.pa.us

In this QUERY:

SELECT keyname
FROM markmain
WHERE mark_id NOT IN(SELECT mark_id
FROM markaty)

I have an index on markaty.mark_id, and have vacuum analyzed. EXPLAIN
shows:

Seq Scan on markmain (cost=2051.43 size=45225 width=12)
SubPlan
-> Seq Scan on markaty (cost=2017.41 size=52558 width=4)

Vadim, why isn't this using the index? Each table has 50k rows. Is it
NOT IN that is causing the problem? IN produces the same plan, though.
If I do a traditional join:

SELECT keyname
FROM markmain , markaty
WHERE markmain.mark_id = markaty.mark_id

I then get a hash join plan:

Hash Join (cost=10768.51 size=90519 width=20)
-> Seq Scan on markmain (cost=2051.43 size=45225 width=16)
-> Hash (cost=0.00 size=0 width=0)
-> Seq Scan on markaty (cost=2017.41 size=52558 width=4)

Seems the optimizer could either hash the subquery, or us an index.
Certainly would be faster than a sequental scan, no?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life IS a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#2Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#1)
Re: [HACKERS] Subqueries and indexes

Bruce Momjian wrote:

In this QUERY:

SELECT keyname
FROM markmain
WHERE mark_id NOT IN(SELECT mark_id
FROM markaty)

I have an index on markaty.mark_id, and have vacuum analyzed. EXPLAIN
shows:

Seq Scan on markmain (cost=2051.43 size=45225 width=12)
SubPlan
-> Seq Scan on markaty (cost=2017.41 size=52558 width=4)

Vadim, why isn't this using the index? Each table has 50k rows. Is it
NOT IN that is causing the problem? IN produces the same plan, though.

....

Seems the optimizer could either hash the subquery, or us an index.
Certainly would be faster than a sequental scan, no?

Optimizer should hash the subquery, but I didn't implement this -:(
Try to rewrite query using NOT EXISTS and index will be used.

Vadim

#3Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Vadim Mikheev (#2)
Re: [HACKERS] Subqueries and indexes

Bruce Momjian wrote:

In this QUERY:

SELECT keyname
FROM markmain
WHERE mark_id NOT IN(SELECT mark_id
FROM markaty)

I have an index on markaty.mark_id, and have vacuum analyzed. EXPLAIN
shows:

Seq Scan on markmain (cost=2051.43 size=45225 width=12)
SubPlan
-> Seq Scan on markaty (cost=2017.41 size=52558 width=4)

Vadim, why isn't this using the index? Each table has 50k rows. Is it
NOT IN that is causing the problem? IN produces the same plan, though.

....

Seems the optimizer could either hash the subquery, or us an index.
Certainly would be faster than a sequental scan, no?

Optimizer should hash the subquery, but I didn't implement this -:(
Try to rewrite query using NOT EXISTS and index will be used.

How hard would it be to implement it? I know you are deep into MVCC,
but doing a nested loop to join a subquery is really bad.

Now, in our defense, I tried this with commercial Ingres 6.4, and it
took so long I copied the data into PostgreSQL and tried to run it
there. Eventually, I copied the data into a second table, and did a
DELETE FROM using two tables in the WHERE clause, and the rows left
where my NOT IN result. It did use a hash join in that case.

Obviously, Ingres was doing a nested loop do, but I want to do better
than Ingres.

I think we really need to get that hash enabled. Is there something I
can do to enable it, or can I do something to help you enable it?

All queries can't be rewritten as EXISTS.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Subqueries and indexes

Bruce Momjian wrote:

Seems the optimizer could either hash the subquery, or us an index.
Certainly would be faster than a sequental scan, no?

Optimizer should hash the subquery, but I didn't implement this -:(
Try to rewrite query using NOT EXISTS and index will be used.

How hard would it be to implement it? I know you are deep into MVCC,
but doing a nested loop to join a subquery is really bad.

Not very hard, for un-correlated subqueries at least.
I have no time to do this for 6.5...

...

All queries can't be rewritten as EXISTS.

All except of subqueries with aggregates in target list.

Vadim

#5Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Vadim Mikheev (#4)
Re: [HACKERS] Subqueries and indexes

Bruce Momjian wrote:

Seems the optimizer could either hash the subquery, or us an index.
Certainly would be faster than a sequental scan, no?

Optimizer should hash the subquery, but I didn't implement this -:(
Try to rewrite query using NOT EXISTS and index will be used.

How hard would it be to implement it? I know you are deep into MVCC,
but doing a nested loop to join a subquery is really bad.

Not very hard, for un-correlated subqueries at least.
I have no time to do this for 6.5...

Is it possible before 6.5 final?

All queries can't be rewritten as EXISTS.

All except of subqueries with aggregates in target list.

I am confused. How do I rewrite this to use exists?

SELECT keyname
FROM markmain
WHERE mark_id NOT IN(SELECT mark_id
FROM markaty)

Even if I use IN instead of NOT IN, I don't see how to do it without
making it a correlated subquery.

SELECT keyname
FROM markmain
WHERE EXISTS (SELECT mark_id
FROM markaty
WHERE markmain.mark_id = markaty.mark_id)

This is a correlated subquery. It did not use hash, but it did use the
index on markaty:

Seq Scan on markmain (cost=16.02 size=334 width=12)
SubPlan
-> Index Scan using i_markaty on markaty (cost=2.10 size=3 width=4)

While the index usage is good, the fact is the subquery is executed for
every row of markmain, isn't it? That's one query executed for each row
in markmain, isn't it?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Bruce Momjian (#5)
Re: [HACKERS] Subqueries and indexes

All except of subqueries with aggregates in target list.

I am confused. How do I rewrite this to use exists?

SELECT keyname
FROM markmain
WHERE mark_id NOT IN(SELECT mark_id
FROM markaty)

Even if I use IN instead of NOT IN, I don't see how to do it without
making it a correlated subquery.

SELECT keyname
FROM markmain
WHERE EXISTS (SELECT mark_id
FROM markaty
WHERE markmain.mark_id = markaty.mark_id)

This is a correlated subquery. It did not use hash, but it did use the
index on markaty:

Seq Scan on markmain (cost=16.02 size=334 width=12)
SubPlan
-> Index Scan using i_markaty on markaty (cost=2.10 size=3 width=4)

While the index usage is good, the fact is the subquery is executed for
every row of markmain, isn't it? That's one query executed for each row
in markmain, isn't it?

I just tried this with NOT EXISTS, and it was VERY fast. Can we discuss
the issues, and perhaps auto-rewrite these as exists. Is that always
better than hash?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#7Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#6)
Re: [HACKERS] Subqueries and indexes

Bruce Momjian wrote:

While the index usage is good, the fact is the subquery is executed for
every row of markmain, isn't it? That's one query executed for each row
in markmain, isn't it?

I just tried this with NOT EXISTS, and it was VERY fast. Can we discuss
the issues, and perhaps auto-rewrite these as exists. Is that always
better than hash?

Not always, but there is no hashing currently, so you could try
re-writing for IN/NOT IN subqueries without aggregates...

Keep in mind that expression subqueries must return <= 1 rows,
so it's probably better don't rewrite them (or you'll have to
add this check to EXISTS code).

Vadim