Recursive optimization of IN subqueries
Hi
As far as I can tell, the pull_up_IN_clauses does not optimize
recursively. Am I totally misguided here?
Index: plan/subselect.c
===================================================================
RCS file:
/projects/cvsroot/pgsql-server/src/backend/optimizer/plan/subselect.c,v
retrieving revision 1.85
diff -u -r1.85 subselect.c
--- plan/subselect.c 25 Nov 2003 23:59:12 -0000 1.85
+++ plan/subselect.c 20 Jan 2004 15:21:55 -0000
@@ -716,6 +716,14 @@
if (contain_volatile_functions((Node *) sublink->lefthand))
return NULL;
+ /*
+ * Optimize recursive
+ */
+ subselect->in_info_list = NIL;
+ if (subselect->hasSubLinks)
+ subselect->jointree->quals = pull_up_IN_clauses(subselect,
+
subselect->jointree->quals);
+
/*
* Okay, pull up the sub-select into top range table and jointree.
*
--
Dennis
Dennis Haney <davh@diku.dk> writes:
As far as I can tell, the pull_up_IN_clauses does not optimize
recursively. Am I totally misguided here?
Yes. The subquery is not being physically folded into the outer query
(so the comment about "pulling up" may be a poor choice of words).
It will still get planned separately by a recursive call to
subquery_planner, and any internal INs will get fixed at that time.
It is possible and even rather likely that the subsequent run of
pull_up_subqueries will flatten the subquery into the outer query,
and if so its internal INs are fixed during pull_up_subqueries.
But doing it here would be redundant.
You can easily prove by experiment that multi-level flattening does
happen, for instance:
regression=# explain select * from tenk1 a where unique1 in
regression-# (select unique2 from tenk1 b where unique1 in
regression(# (select thousand from tenk1 c where hundred = 99));
QUERY PLAN
--------------------------------------------------------------------------------
------------------------
Nested Loop (cost=411.66..471.82 rows=10 width=244)
-> HashAggregate (cost=411.66..411.66 rows=10 width=4)
-> Nested Loop (cost=351.47..411.63 rows=10 width=4)
-> HashAggregate (cost=351.47..351.47 rows=10 width=4)
-> Index Scan using tenk1_hundred on tenk1 c (cost=0.00..351.23 rows=99 width=4)
Index Cond: (hundred = 99)
-> Index Scan using tenk1_unique1 on tenk1 b (cost=0.00..6.00 rows=1 width=8)
Index Cond: (b.unique1 = "outer".thousand)
-> Index Scan using tenk1_unique1 on tenk1 a (cost=0.00..6.00 rows=1 width=244)
Index Cond: (a.unique1 = "outer".unique2)
(10 rows)
regards, tom lane
Tom Lane wrote:
Dennis Haney <davh@diku.dk> writes:
As far as I can tell, the pull_up_IN_clauses does not optimize
recursively. Am I totally misguided here?Yes. The subquery is not being physically folded into the outer query
(so the comment about "pulling up" may be a poor choice of words).
It will still get planned separately by a recursive call to
subquery_planner, and any internal INs will get fixed at that time.It is possible and even rather likely that the subsequent run of
pull_up_subqueries will flatten the subquery into the outer query,
and if so its internal INs are fixed during pull_up_subqueries.
But doing it here would be redundant.
I think I figured it out now, after looking at it for hours...
I saw it as though convert_IN_to_join rewrote the query from
select a.* from tenk1 a where a.unique1 in
(select c.thousand from tenk1 c where c.hundred = 99);
to
select a.* from tenk1 a, tenk1 c where a.unique1 = c.thousand AND
c.hundred = 99;
But after looking at it, I've reached the conclusion that the rewrite is
to this instead:
select a.* from tenk1 a, (select d.thousand from tenk1 d where
d.hundred = 99) as c where a.unique1 = c.thousand;
except the subselect is added as a range table entry instead of a
subselect in the from-list (not that I understand this particular part,
do you mind explaining?).
Or am I still totally lost?
--
Dennis
Dennis Haney <davh@diku.dk> writes:
I saw it as though convert_IN_to_join rewrote the query from
select a.* from tenk1 a where a.unique1 in
(select c.thousand from tenk1 c where c.hundred = 99);
to
select a.* from tenk1 a, tenk1 c where a.unique1 = c.thousand AND
c.hundred = 99;
But after looking at it, I've reached the conclusion that the rewrite is
to this instead:
select a.* from tenk1 a, (select d.thousand from tenk1 d where
d.hundred = 99) as c where a.unique1 = c.thousand;
Right. We do that, and then subsequently pull_up_subqueries transforms
it to the other representation. The reason for this two-step approach
is that the intermediate form is still a useful improvement if the
subquery can't be pulled up for some reason (e.g., it's got grouping).
except the subselect is added as a range table entry instead of a
subselect in the from-list (not that I understand this particular part,
do you mind explaining?).
Same thing. Every entry in the from-list will have both an RTE and an
entry in the join tree. This representation is partly historical
(before we had outer joins, there was only the range table and no join
tree at all), but it is convenient for many purposes.
regards, tom lane
PS: this is a bit off-topic for pgsql-general, please pursue it on
-hackers if you have more questions.
Tom Lane wrote:
Dennis Haney <davh@diku.dk> writes:
I saw it as though convert_IN_to_join rewrote the query from
select a.* from tenk1 a where a.unique1 in
(select c.thousand from tenk1 c where c.hundred = 99);to
select a.* from tenk1 a, tenk1 c where a.unique1 = c.thousand AND
c.hundred = 99;but after looking at it, I've reached the conclusion that the rewrite is
to this instead:select a.* from tenk1 a, (select d.thousand from tenk1 d where
d.hundred = 99) as c where a.unique1 = c.thousand;Right. We do that, and then subsequently pull_up_subqueries transforms
it to the other representation. The reason for this two-step approach
is that the intermediate form is still a useful improvement if the
subquery can't be pulled up for some reason (e.g., it's got grouping).
With improvement I can see that it can be materialized and thus used as
a normal table in the planner. Is there any additional reasons that I
can't see?
But this limited optimization makes me wonder, why the limitation to
optimizing '='?
And why must the lefthand of the sublink be a variable of the upper query?
except the subselect is added as a range table entry instead of a
subselect in the from-list (not that I understand this particular part,
do you mind explaining?).Same thing. Every entry in the from-list will have both an RTE and an
entry in the join tree. This representation is partly historical
(before we had outer joins, there was only the range table and no join
tree at all), but it is convenient for many purposes.
Then I don't understand why it gives two different execution plans? And
the Query* is totally different for the two, eg. there is no RTE for the
subquery in the first query:
davh=# explain select a.* from test1 a, (select num from test1 where id = 2) as b where a.num = b.num;
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=4.83..29.94 rows=11 width=8)
Hash Cond: ("outer".num = "inner".num)
-> Seq Scan on test1 a (cost=0.00..20.00 rows=1000 width=8)
-> Hash (cost=4.82..4.82 rows=2 width=4)
-> Index Scan using test1_pkey on test1 (cost=0.00..4.82 rows=2 width=4)
Index Cond: (id = 2)
(6 rows)
davh=# explain select a.* from test1 a where a.num in (select num from test1 where id = 2);
QUERY PLAN
------------------------------------------------------------------------------------
Hash IN Join (cost=4.83..28.75 rows=6 width=8)
Hash Cond: ("outer".num = "inner".num)
-> Seq Scan on test1 a (cost=0.00..20.00 rows=1000 width=8)
-> Hash (cost=4.82..4.82 rows=2 width=4)
-> Index Scan using test1_pkey on test1 (cost=0.00..4.82 rows=2 width=4)
Index Cond: (id = 2)
(6 rows)
PS: this is a bit off-topic for pgsql-general, please pursue it on
-hackers if you have more questions.
ok
--
Dennis
Dennis Haney <davh@diku.dk> writes:
But this limited optimization makes me wonder, why the limitation to
optimizing '='?
In the first place, you wouldn't get any improvement anyway if the
combining operator is not '=' --- if it isn't, then merge and hash join
aren't applicable and so you're gonna end up with a nestloop anyhow,
which is no better than what the executor will do with a subselect.
In the second place, what the code is doing is dependent on an understanding
of the semantics of IN; I'm not sure it's applicable to, say,
WHERE outervar > ANY (SELECT innervar FROM ...)
and it's definitely not applicable to
WHERE outervar > ALL (SELECT innervar FROM ...)
In particular, the optimization paths that involve unique-ifying the
subselect output and then using it as the outer side of a join would
definitely not work for these sorts of things.
And why must the lefthand of the sublink be a variable of the upper query?
Otherwise the expression isn't a join and I don't think the semantics are
the same as the code is expecting.
Then I don't understand why it gives two different execution plans?
They look the same to me, other than that a different join rule is
needed (because after all IN is not the same thing as a straight join).
regards, tom lane
Tom Lane writes
In the second place, what the code is doing is dependent on an
understanding
of the semantics of IN; I'm not sure it's applicable to, say,
WHERE outervar > ANY (SELECT innervar FROM ...)
and it's definitely not applicable to
WHERE outervar > ALL (SELECT innervar FROM ...)
In particular, the optimization paths that involve unique-ifying the
subselect output and then using it as the outer side of a join would
definitely not work for these sorts of things.
I'm not sure if I've understood you correctly in the section above. Are
you saying that these types of queries don't have a meaningful or
defined response? Or just that they wouldn't be very well optimized as a
result of the unique-ifying code changes? Or have I just mis-read the
thread...
My understanding is that in ANSI SQL99, the expression
expression > ALL (subquery)
- is TRUE when expression is greater than every value in the set
of values returned by subquery.
- is TRUE if subquery returns no values.
The expression
expression > ANY (subquery)
- is TRUE when expression is greater than at least one value of
the set of values returned by subquery.
- is FALSE if subsquery returns no values.
(As supported by Oracle 9iv2 and Teradata v2r5.0.)
Best regards, Simon
Tom Lane writes
In the second place, what the code is doing is dependent on an
understanding
of the semantics of IN; I'm not sure it's applicable to, say,
WHERE outervar > ANY (SELECT innervar FROM ...)
and it's definitely not applicable to
WHERE outervar > ALL (SELECT innervar FROM ...)
In particular, the optimization paths that involve unique-ifying the
subselect output and then using it as the outer side of a join would
definitely not work for these sorts of things.
I'm not sure if I've understood you correctly in the section above. Are
you saying that these types of queries don't have a meaningful or
defined response? Or just that they wouldn't be very well optimized as a
result of the unique-ifying code changes? Or have I just mis-read the
thread...
My understanding is that in ANSI SQL99, the expression
expression > ALL (subquery)
- is TRUE when expression is greater than every value in the set
of values returned by subquery.
- is TRUE if subquery returns no values.
The expression
expression > ANY (subquery)
- is TRUE when expression is greater than at least one value of
the set of values returned by subquery.
- is FALSE if subsquery returns no values.
(As supported by Oracle 9iv2 and Teradata v2r5.0.)
Best regards, Simon
Simon Riggs wrote:
Tom Lane writes
In the second place, what the code is doing is dependent on an
understanding
of the semantics of IN; I'm not sure it's applicable to, say,
WHERE outervar > ANY (SELECT innervar FROM ...)
and it's definitely not applicable to
WHERE outervar > ALL (SELECT innervar FROM ...)
In particular, the optimization paths that involve unique-ifying the
subselect output and then using it as the outer side of a join would
definitely not work for these sorts of things.I'm not sure if I've understood you correctly in the section above. Are
you saying that these types of queries don't have a meaningful or
defined response? Or just that they wouldn't be very well optimized as a
result of the unique-ifying code changes? Or have I just mis-read the
thread...
I think Tom is refering to the context of the specific optimization.
The optimization we are discussing does nothing to correlated
subqueries, and a uncorrolated subquery with > ALL/ANY is actually a
computed constant and not a join.
--
Dennis
My mistake then. Better to check than let a logical hole in. Thanks for
letting me know, Simon
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dennis Haney
Sent: Tuesday, January 27, 2004 14:33
To: simon@2ndquadrant.com
Cc: 'Tom Lane'; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Recursive optimization of IN subqueries
Simon Riggs wrote:
Tom Lane writes
In the second place, what the code is doing is dependent on an
understanding
of the semantics of IN; I'm not sure it's applicable to, say,
WHERE outervar > ANY (SELECT innervar FROM ...)
and it's definitely not applicable to
WHERE outervar > ALL (SELECT innervar FROM ...)
In particular, the optimization paths that involve unique-ifying the
subselect output and then using it as the outer side of a join would
definitely not work for these sorts of things.
I'm not sure if I've understood you correctly in the section above. Are
you saying that these types of queries don't have a meaningful or
defined response? Or just that they wouldn't be very well optimized as a
result of the unique-ifying code changes? Or have I just mis-read the
thread...
I think Tom is refering to the context of the specific optimization.
The optimization we are discussing does nothing to correlated
subqueries, and a uncorrolated subquery with > ALL/ANY is actually a
computed constant and not a join.
--
Dennis
"Simon Riggs" <simon@2ndquadrant.com> writes:
Tom Lane writes
In particular, the optimization paths that involve unique-ifying the
subselect output and then using it as the outer side of a join would
definitely not work for these sorts of things.
I'm not sure if I've understood you correctly in the section above. Are
you saying that these types of queries don't have a meaningful or
defined response? Or just that they wouldn't be very well optimized as a
result of the unique-ifying code changes?
I mean that if the unique-ifying implementation were used, it'd deliver
the wrong answer (too many rows out). You could possibly carry through
a set of extensions to check which kind of sub-SELECT was in use and not
apply transformations that aren't correct, but it'd be a great deal more
complexity for something of marginal value. As far as I've seen, people
don't use inequalities in ANY/ALL subselects very much, and so I'm not
excited about complicating the planner to support them better.
regards, tom lane