Idea for speeding up uncorrelated subqueries
Someone was just complaining over in the sql list about the poor
performance of
select name,description from descriptions
where name in (select name
from descriptions
where description like '%Bankverbindung%');
Since the inner query is uncorrelated with the outer, there's really
no need to execute it more than once, but currently it's re-executed
each time through the outer plan.
I wonder whether it wouldn't be a good idea to force a Materialize
node to be added to the top of an uncorrelated subplan? Then at
least the re-executions would be pretty cheap...
regards, tom lane
Someone was just complaining over in the sql list about the poor
performance ofselect name,description from descriptions
where name in (select name
from descriptions
where description like '%Bankverbindung%');Since the inner query is uncorrelated with the outer, there's really
no need to execute it more than once, but currently it's re-executed
each time through the outer plan.I wonder whether it wouldn't be a good idea to force a Materialize
node to be added to the top of an uncorrelated subplan? Then at
least the re-executions would be pretty cheap...
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO. The multiple query
executions are not on the TODO list. Not sure why this is happening
here.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO.
Huh? I don't follow that at all...
The multiple query executions are not on the TODO list. Not sure why
this is happening here.
After looking at subselect.c I think I understand why --- InitPlans are
only for subqueries that are known to return a *single* reslt. When you
have a subquery that might potentially return many, many tuples, you
need to scan through those tuples, so we use SubPlan tactics even if
there's not a query correlation.
But this neglects the cost of re-executing the subplan over and over.
Materializing the result should help, no? (Of course there are cases
where it won't, such as when the subplan is just an unqualified select,
but most of the time it should be a win, I think...)
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofThu5Aug1999110728-0400199908051507.LAA26036@candle.pha.pa.us | Resolved by subject fallback
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO.Huh? I don't follow that at all...
Suppose you have a subquery that returns 1000 rows. There is no code so
lookups of the inner table are indexed:
select *
from tab
where col in (select col2 from tab2)
In this case, a sequential scan of the subquery results are required. I
didn't think the subquery was executed every time it needed to see if
col1 was in the subquery.
The multiple query executions are not on the TODO list. Not sure why
this is happening here.After looking at subselect.c I think I understand why --- InitPlans are
only for subqueries that are known to return a *single* reslt. When you
have a subquery that might potentially return many, many tuples, you
need to scan through those tuples, so we use SubPlan tactics even if
there's not a query correlation.But this neglects the cost of re-executing the subplan over and over.
Materializing the result should help, no? (Of course there are cases
where it won't, such as when the subplan is just an unqualified select,
but most of the time it should be a win, I think...)
No what Vadim is done MVCC, I would like to bug him to improve subquery
performance. We are tweeking the optimizer, but we have this huge
subquery performance problem here.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO.Huh? I don't follow that at all...
Suppose you have a subquery that returns 1000 rows. There is no code so
lookups of the inner table are indexed:
select *
from tab
where col in (select col2 from tab2)
In this case, a sequential scan of the subquery results are required.
Well, yes, the subquery is a sequential scan. I guess what you are
envisioning is rewriting this into some kind of nested-loop join?
For simple cases that might be possible...
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofThu5Aug1999114806-0400199908051548.LAA27782@candle.pha.pa.us | Resolved by subject fallback
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO.Huh? I don't follow that at all...
Suppose you have a subquery that returns 1000 rows. There is no code so
lookups of the inner table are indexed:select *
from tab
where col in (select col2 from tab2)In this case, a sequential scan of the subquery results are required.
Well, yes, the subquery is a sequential scan. I guess what you are
envisioning is rewriting this into some kind of nested-loop join?
For simple cases that might be possible...
Yes, or mergejoin/hashjoin.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO. The multiple query
^^^^^^^^^^^^^^^
What? Indices are used when appropriate.
Vadim
Bruce Momjian wrote:
Yes, the subqueries need work. We don't even do index lookups into the
inner plan, only sequential. Already on TODO. The multiple query^^^^^^^^^^^^^^^
What? Indices are used when appropriate.
Sorry, bad wording. My English should be better. :-)
I meant to say that joins from the outer plan to subplan are always
nested loops, not hash or mergejoins. If you have something like:
delete from tab where col in (select col2 from largetable)
it takes a long time, when it really should be quick. That is why
people are encouraged to use EXISTS.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
Suppose you have a subquery that returns 1000 rows. There is no code so
lookups of the inner table are indexed:select *
from tab
where col in (select col2 from tab2)In this case, a sequential scan of the subquery results are required. I
didn't think the subquery was executed every time it needed to see if
col1 was in the subquery.
Oh, well. These are cases when query may be rewritten with EXISTS.
This is possible when there are no aggregates in subquery.
After looking at subselect.c I think I understand why --- InitPlans are
only for subqueries that are known to return a *single* reslt. When you
have a subquery that might potentially return many, many tuples, you
need to scan through those tuples, so we use SubPlan tactics even if
there's not a query correlation.
Yes. But as I said already, you can use InitPlan (i.e. execute subquery
first) after removing duplicates from subquery results.
But this neglects the cost of re-executing the subplan over and over.
Materializing the result should help, no? (Of course there are cases
We could not only cache subquery results (materialization) but also
hash them.
where it won't, such as when the subplan is just an unqualified select,
but most of the time it should be a win, I think...)
In such cases, if there are no aggregates in subquery then EXISTS
could be used else materialization will still help.
No what Vadim is done MVCC, I would like to bug him to improve subquery
performance. We are tweeking the optimizer, but we have this huge
subquery performance problem here.
No, Bruce. I'm in WAL now. I think that we need in recovery
(remember that you'll lose indices being updated when some
crash took place), fast backup (it's easy to copy part of log
than dump 1Gb table), fast commits (<= 1 fsync per commit
using group commit, instead of >= 2 fsyncs now), savepoints
AND buffered logging, which you, Bruce, want so much,
and so long -:).
Vadim
where it won't, such as when the subplan is just an unqualified select,
but most of the time it should be a win, I think...)In such cases, if there are no aggregates in subquery then EXISTS
could be used else materialization will still help.No what Vadim is done MVCC, I would like to bug him to improve subquery
performance. We are tweeking the optimizer, but we have this huge
subquery performance problem here.No, Bruce. I'm in WAL now. I think that we need in recovery
(remember that you'll lose indices being updated when some
crash took place), fast backup (it's easy to copy part of log
than dump 1Gb table), fast commits (<= 1 fsync per commit
using group commit, instead of >= 2 fsyncs now), savepoints
AND buffered logging, which you, Bruce, want so much,
and so long -:).
Oh, no, I have been outmaneuvered by Vadim.
Help.
Isn't it something that takes only a few hours to implement. We can't
keep telling people to us EXISTS, especially because most SQL people
think correlated queries are slower that non-correlated ones. Can we
just on-the-fly rewrite the query to use exists?
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
Isn't it something that takes only a few hours to implement. We can't
keep telling people to us EXISTS, especially because most SQL people
think correlated queries are slower that non-correlated ones. Can we
just on-the-fly rewrite the query to use exists?
This seems easy to implement. We could look does subquery have
aggregates or not before calling union_planner() in
subselect.c:_make_subplan() and rewrite it (change
slink->subLinkType from IN to EXISTS and add quals).
Without caching implemented IN-->EXISTS rewriting always
has sence.
After implementation of caching we probably should call union_planner()
for both original/modified subqueries and compare costs/sizes
of EXISTS/IN_with_caching plans and maybe even make
decision what plan to use after parent query is planned
and we know for how many parent rows subplan will be executed.
Vadim
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Isn't it something that takes only a few hours to implement. We can't
keep telling people to us EXISTS, especially because most SQL people
think correlated queries are slower that non-correlated ones. Can we
just on-the-fly rewrite the query to use exists?
I was just about to suggest exactly that. The "IN (subselect)"
notation seems to be a lot more intuitive --- at least, people
keep coming up with it --- so why not rewrite it to the EXISTS
form, if we can handle that more efficiently?
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofThu5Aug1999233101-0400199908060331.XAA22277@candle.pha.pa.us | Resolved by subject fallback
Bruce Momjian <maillist@candle.pha.pa.us> writes:
Isn't it something that takes only a few hours to implement. We can't
keep telling people to us EXISTS, especially because most SQL people
think correlated queries are slower that non-correlated ones. Can we
just on-the-fly rewrite the query to use exists?I was just about to suggest exactly that. The "IN (subselect)"
notation seems to be a lot more intuitive --- at least, people
keep coming up with it --- so why not rewrite it to the EXISTS
form, if we can handle that more efficiently?
Yes, we have the nice subselect feature. I just hate to see it not
completely finished, performance-wise.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I thought that all correlated and uncorrelated sub-queries could be
rewritten as a join, simplifying the query tree. It should be a mechanical
process which can probably be performed in the rewriter.
Somebody put me right.
MikeA
Show quoted text
-----Original Message-----
From: Bruce Momjian [mailto:maillist@candle.pha.pa.us]
Sent: Friday, August 06, 1999 7:11 AM
To: Tom Lane
Cc: Vadim Mikheev; pgsql-hackers@postgreSQL.org
Subject: Re: [HACKERS] Idea for speeding up uncorrelated subqueriesBruce Momjian <maillist@candle.pha.pa.us> writes:
Isn't it something that takes only a few hours to
implement. We can't
keep telling people to us EXISTS, especially because
most SQL people
think correlated queries are slower that non-correlated
ones. Can we
just on-the-fly rewrite the query to use exists?
I was just about to suggest exactly that. The "IN (subselect)"
notation seems to be a lot more intuitive --- at least, people
keep coming up with it --- so why not rewrite it to the EXISTS
form, if we can handle that more efficiently?Yes, we have the nice subselect feature. I just hate to see it not
completely finished, performance-wise.-- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Resolved by subject fallback
"Ansley, Michael" wrote:
I thought that all correlated and uncorrelated sub-queries could be
rewritten as a join, simplifying the query tree. It should be a mechanical
process which can probably be performed in the rewriter.
IN can't be rewritten as a join! Subquery may return duplicates
and join would return tuple for all of them.
And how about WHERE x = (select max(y) from ...) ?
And even for WHERE x = (select y from ...) we have to check
that subquery returns exactly ONE row, or abort.
Vadim