subselect

Started by Bruce Momjianabout 28 years ago13 messages

maillist@candle.pha.pa.us

about 28 years ago

I was thinking about subselects, and how to attach the two queries.

What if the subquery makes a range table entry in the outer query, and
the query is set up like the UNION queries where we put the scans in a
row, but in the case we put them over/under each other.

And we push a temp table into the catalog cache that represents the
result of the subquery, then we could join to it in the outer query as
though it was a real table.

Also, can't we do the correlated subqueries by adding the proper
target/output columns to the subquery, and have the outer query
reference those columns in the subquery range table entry.

Maybe I can write up a sample of this? Vadim, would this help? Is this
the point we are stuck at?

--
Bruce Momjian
maillist@candle.pha.pa.us

Vadim B. Mikheev

vadim@sable.krasnoyarsk.su

about 28 years ago

In reply to: Bruce Momjian (#1)

Re: [HACKERS] subselect

Bruce Momjian wrote:

I was thinking about subselects, and how to attach the two queries.

What if the subquery makes a range table entry in the outer query, and
the query is set up like the UNION queries where we put the scans in a
row, but in the case we put them over/under each other.

And we push a temp table into the catalog cache that represents the
result of the subquery, then we could join to it in the outer query as
though it was a real table.

Also, can't we do the correlated subqueries by adding the proper
target/output columns to the subquery, and have the outer query
reference those columns in the subquery range table entry.

Yes, this is a way to handle subqueries by joining to temp table.
After getting plan we could change temp table access path to
node material. On the other hand, it could be useful to let optimizer
know about cost of temp table creation (have to think more about it)...
Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
is one example of this - joining by <> will give us invalid results.
Setting special NOT EQUAL flag is not enough: subquery plan must be
always inner one in this case. The same for handling ALL modifier.
Note, that we generaly can't use aggregates here: we can't add MAX to
subquery in the case of > ALL (subquery), because of > ALL should return FALSE
if subquery returns NULL(s) but aggregates don't take NULLs into account.

Maybe I can write up a sample of this? Vadim, would this help? Is this
the point we are stuck at?

Personally, I was stuck by holydays -:)
Now I can spend ~ 8 hours ~ each day for development...

Vadim

Goran Thyni

goran@bildbasen.se

about 28 years ago

In reply to: Vadim B. Mikheev (#2)

Re: [HACKERS] subselect

Vadim,

Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
is one example of this - joining by <> will give us invalid results.

What is you approach towards this problem?
I got an idea that one could reverse the order,
that is execute the outer first into a temptable
and delete from that according to the result of the
subquery and then return it.
Probably this is too raw and slow. ;-)

Personally, I was stuck by holydays -:)
Now I can spend ~ 8 hours ~ each day for development...

Oh, isn't it christmas eve right now in Russia?

best regards,
--
---------------------------------------------
Gï¿½ran Thyni, sysadm, JMS Bildbasen, Kiruna

Bruce Momjian

maillist@candle.pha.pa.us

about 28 years ago

In reply to: Vadim B. Mikheev (#2)

Re: [HACKERS] subselect

Yes, this is a way to handle subqueries by joining to temp table.
After getting plan we could change temp table access path to
node material. On the other hand, it could be useful to let optimizer
know about cost of temp table creation (have to think more about it)...
Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
is one example of this - joining by <> will give us invalid results.
Setting special NOT EQUAL flag is not enough: subquery plan must be
always inner one in this case. The same for handling ALL modifier.
Note, that we generaly can't use aggregates here: we can't add MAX to
subquery in the case of > ALL (subquery), because of > ALL should return FALSE
if subquery returns NULL(s) but aggregates don't take NULLs into account.

OK, here are my ideas. First, I think you have to handle subselects in
the outer node because a subquery could have its own subquery. Also, we
now have a field in Aggreg to all us to 'usenulls'.

OK, here it is. I recommend we pass the outer and subquery through
the parser and optimizer separately.

We parse the subquery first. If the subquery is not correlated, it
should parse fine. If it is correlated, any columns we find in the
subquery that are not already in the FROM list, we add the table to the
subquery FROM list, and add the referenced column to the target list of
the subquery.

When we are finished parsing the subquery, we create a catalog cache
entry for it called 'sub1' and make its fields match the target
list of the subquery.

In the outer query, we add 'sub1' to its target list, and change
the subquery reference to point to the new range table. We also add
WHERE clauses to do any correlated joins.

Here is a simple example:

select *
from taba
where col1 = (select col2
from tabb)

This is not correlated, and the subquery parser easily. We create a
'sub1' catalog cache entry, and add 'sub1' to the outer query FROM
clause. We also replace 'col1 = (subquery)' with 'col1 = sub1.col2'.

Here is a more complex correlated subquery:

select *
from taba
where col1 = (select col2
from tabb
where taba.col3 = tabb.col4)

Here we must add 'taba' to the subquery's FROM list, and add col3 to the
target list of the subquery. After we parse the subquery, add 'sub1' to
the FROM list of the outer query, change 'col1 = (subquery)' to 'col1 =
sub1.col2', and add to the outer WHERE clause 'AND taba.col3 = sub1.col3'.
THe optimizer will do the correlation for us.

In the optimizer, we can parse the subquery first, then the outer query,
and then replace all 'sub1' references in the outer query to use the
subquery plan.

I realize making merging the two plans and doing IN and NOT IN is the
real challenge, but I hoped this would give us a start.

What do you think?

--
Bruce Momjian
maillist@candle.pha.pa.us

Vadim B. Mikheev

vadim@sable.krasnoyarsk.su

about 28 years ago

In reply to: Bruce Momjian (#4)

Re: [HACKERS] subselect

Bruce Momjian wrote:

always inner one in this case. The same for handling ALL modifier.
Note, that we generaly can't use aggregates here: we can't add MAX to
subquery in the case of > ALL (subquery), because of > ALL should return FALSE
if subquery returns NULL(s) but aggregates don't take NULLs into account.

OK, here are my ideas. First, I think you have to handle subselects in
the outer node because a subquery could have its own subquery. Also, we

I hope that this is no matter: if results of subquery (with/without sub-subqueries)
will go into temp table then this table will be re-scanned for each outer tuple.

now have a field in Aggreg to all us to 'usenulls'.

^^^^^^^^
This can't help:

vac=> select * from x;
y
-
1
2
3
<<< this is NULL
(4 rows)

vac=> select max(y) from x;
max
---
3

==> we can't replace

select * from A where A.a > ALL (select y from x);
^^^^^^^^^^^^^^^
(NULL will be returned and so A.a > ALL is FALSE - this is what
Sybase does, is it right ?)
with

select * from A where A.a > (select max(y) from x);
^^^^^^^^^^^^^^^^^^^^
just because of we lose knowledge about NULLs here.

Also, I would like to handle ANY and ALL modifiers for all bool
operators, either built-in or user-defined, for all data types -
isn't PostgreSQL OO-like RDBMS -:)

OK, here it is. I recommend we pass the outer and subquery through
the parser and optimizer separately.

I don't like this. I would like to get parse-tree from parser for
entire query and let optimizer (on upper level) decide how to rewrite
parse-tree and what plans to produce and how these plans should be
merged. Note, that I don't object your methods below, but only where
to place handling of this. I don't understand why should we add
new part to the system which will do optimizer' work (parse-tree -->
execution plan) and deal with optimizer nodes. Imho, upper optimizer
level is nice place to do this.

We parse the subquery first. If the subquery is not correlated, it
should parse fine. If it is correlated, any columns we find in the
subquery that are not already in the FROM list, we add the table to the
subquery FROM list, and add the referenced column to the target list of
the subquery.

When we are finished parsing the subquery, we create a catalog cache
entry for it called 'sub1' and make its fields match the target
list of the subquery.

In the outer query, we add 'sub1' to its target list, and change
the subquery reference to point to the new range table. We also add
WHERE clauses to do any correlated joins.

...

Here is a more complex correlated subquery:

select *
from taba
where col1 = (select col2
from tabb
where taba.col3 = tabb.col4)

Here we must add 'taba' to the subquery's FROM list, and add col3 to the
target list of the subquery. After we parse the subquery, add 'sub1' to
the FROM list of the outer query, change 'col1 = (subquery)' to 'col1 =
sub1.col2', and add to the outer WHERE clause 'AND taba.col3 = sub1.col3'.
THe optimizer will do the correlation for us.

In the optimizer, we can parse the subquery first, then the outer query,
and then replace all 'sub1' references in the outer query to use the
subquery plan.

I realize making merging the two plans and doing IN and NOT IN is the

^^^^^^^^^^^^^^^^^^^^^
This is very easy to do! As I already said we have just change sub1
access path (SeqScan of sub1) with SeqScan of Material node with
subquery plan.

real challenge, but I hoped this would give us a start.

Decision about how to record subquery stuff in to parse-tree
would be very good start -:)

BTW, note that for _expression_ subqueries (which are introduced without
IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
we have to check that subquery returns single tuple...

Vadim

Bruce Momjian

maillist@candle.pha.pa.us

about 28 years ago

In reply to: Vadim B. Mikheev (#5)

Re: [HACKERS] subselect

Bruce Momjian wrote:

always inner one in this case. The same for handling ALL modifier.
Note, that we generaly can't use aggregates here: we can't add MAX to
subquery in the case of > ALL (subquery), because of > ALL should return FALSE
if subquery returns NULL(s) but aggregates don't take NULLs into account.

OK, here are my ideas. First, I think you have to handle subselects in
the outer node because a subquery could have its own subquery. Also, we

I hope that this is no matter: if results of subquery (with/without sub-subqueries)
will go into temp table then this table will be re-scanned for each outer tuple.

OK, sounds good.

now have a field in Aggreg to all us to 'usenulls'.

^^^^^^^^
This can't help:

vac=> select * from x;
y
-
1
2
3
<<< this is NULL
(4 rows)

vac=> select max(y) from x;
max
---
3

==> we can't replace

select * from A where A.a > ALL (select y from x);
^^^^^^^^^^^^^^^
(NULL will be returned and so A.a > ALL is FALSE - this is what
Sybase does, is it right ?)
with

select * from A where A.a > (select max(y) from x);

I agree. I don't see how we can ever replace an '> ALL (y)' with '> ALL
(max(y))'. This sounds too detailed for the system to deal with. If
they do ALL, we have to implement ALL, without any use of aggregates to
try and second-guess their request.

^^^^^^^^^^^^^^^^^^^^
just because of we lose knowledge about NULLs here.

Yep. And it is too much work. If they want to replace the query with
max(), let them do it, if not, we do what they requested.

Also, I would like to handle ANY and ALL modifiers for all bool
operators, either built-in or user-defined, for all data types -
isn't PostgreSQL OO-like RDBMS -:)

OK, sounds good to me.

OK, here it is. I recommend we pass the outer and subquery through
the parser and optimizer separately.

I don't like this. I would like to get parse-tree from parser for
entire query and let optimizer (on upper level) decide how to rewrite
parse-tree and what plans to produce and how these plans should be
merged. Note, that I don't object your methods below, but only where
to place handling of this. I don't understand why should we add
new part to the system which will do optimizer' work (parse-tree -->
execution plan) and deal with optimizer nodes. Imho, upper optimizer
level is nice place to do this.

I am confused. Do you want one flat query and want to pass the whole
thing into the optimizer? That brings up some questions:

How do we want to do this? Well, we could easily have the two queries
share the same range table by making the subquery have the proper
alias's/refnames.

However, how do we represent the join and correlated joins to the
subquery. We can do the correlated stuff by having the outer columns
reference the inner queries range table entries that we added, but how
to represent the subquery WHERE clause, and the join of the outer to
inner queries?

In:

select *
from taba
where col1 = (select col2
from tabb
where taba.col3 = tabb.col4)

How do we represent join of col1 to tabb.col2? I guess we have a new
node type for IN and NOT IN and ANY, and we put that operator in the
parse grammar.

So I assume you are suggesting we flatten the query, to merge the range
tables of the two queries, and the WHERE clauses of the two queries, add
the proper WHERE conditionals to join the two range tables for
correlated queries, and have the IN, NOT IN, ALL nodes in the WHERE
clause, and have the optimizer figure out how to handle the issues.

How do we handle aggregates in the subquery? Currently the optimizer
does those last, but we must put them above the materialized node. And
if we merge the outer and subquery to produce one flat query, how do we
tell the optimizer to make sure the aggregate is in a node that can be
materialized?

---------------------------------------------------------------------------

If you don't want to flatten the outer query and subquery into one
query, I am really confused. There certainly will be stuff that needs
to be put into the upper optimizer, to properly handle the two plans and
make sure they are merged into one plan.

Are you suggesting we put the IN node in the upper optimizer, and the
correlation stuff. That sounds good.

I realize making merging the two plans and doing IN and NOT IN is the

^^^^^^^^^^^^^^^^^^^^^
This is very easy to do! As I already said we have just change sub1
access path (SeqScan of sub1) with SeqScan of Material node with
subquery plan.

Good. Makes sense. This is what I was suggesting.

real challenge, but I hoped this would give us a start.

Decision about how to record subquery stuff in to parse-tree
would be very good start -:)

BTW, note that for _expression_ subqueries (which are introduced without
IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
we have to check that subquery returns single tuple...

Yes, I realize this.

--
Bruce Momjian
maillist@candle.pha.pa.us

Bruce Momjian

maillist@candle.pha.pa.us

about 28 years ago

In reply to: Bruce Momjian (#6)

Re: [HACKERS] subselect

I am confused. Do you want one flat query and want to pass the whole
thing into the optimizer? That brings up some questions:

No. I just want to follow Tom's way: I would like to see new
SubSelect node as shortened version of struct Query (or use
Query structure for each subquery - no matter for me), some
subquery-related stuff added to Query (and SubSelect) to help
optimizer to start, and see

OK, so you want the subquery to actually be INSIDE the outer query
expression. Do they share a common range table? If they don't, we
could very easily just fly through when processing the WHERE clause, and
start a new query using a new query structure for the subquery. Believe
me, you don't want a separate SubQuery-type, just re-use Query for it.
It allows you to call all the normal query stuff with a consistent
structure.

The parser will need to know it is in a subquery, so it can add the
proper target columns to the subquery, or are you going to do that in
the optimizer. You can do it in the optimizer, and join the range table
references there too.

typedef struct A_Expr
{
NodeTag type;
int oper; /* type of operation
* {OP,OR,AND,NOT,ISNULL,NOTNULL} */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IN, NOT IN, ANY, ALL, EXISTS here,

char *opname; /* name of operator/function */
Node *lexpr; /* left argument */
Node *rexpr; /* right argument */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and SubSelect (Query) here (as possible case).

One thought to follow this way: RULEs (and so - VIEWs) are handled by using
Query - how else can we implement VIEWs on selects with subqueries ?

Views are stored as nodeout structures, and are merged into the query's
from list, target list, and where clause. I am working out
readfunc,outfunc now to make sure they are up-to-date with all the
current fields.

BTW, is

select * from A where (select TRUE from B);

valid syntax ?

I don't think so.

--
Bruce Momjian
maillist@candle.pha.pa.us

Import Notes

Reply to msg id not found: 34B15C23.B24D5CC@sable.krasnoyarsk.su | Resolved by subject fallback

Vadim B. Mikheev

vadim@sable.krasnoyarsk.su

about 28 years ago

In reply to: Bruce Momjian (#6)

Re: [HACKERS] subselect

Bruce Momjian wrote:

OK, here it is. I recommend we pass the outer and subquery through
the parser and optimizer separately.

I don't like this. I would like to get parse-tree from parser for
entire query and let optimizer (on upper level) decide how to rewrite
parse-tree and what plans to produce and how these plans should be
merged. Note, that I don't object your methods below, but only where
to place handling of this. I don't understand why should we add
new part to the system which will do optimizer' work (parse-tree -->
execution plan) and deal with optimizer nodes. Imho, upper optimizer
level is nice place to do this.

I am confused. Do you want one flat query and want to pass the whole
thing into the optimizer? That brings up some questions:

No. I just want to follow Tom's way: I would like to see new
SubSelect node as shortened version of struct Query (or use
Query structure for each subquery - no matter for me), some
subquery-related stuff added to Query (and SubSelect) to help
optimizer to start, and see

typedef struct A_Expr
{
NodeTag type;
int oper; /* type of operation
* {OP,OR,AND,NOT,ISNULL,NOTNULL} */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IN, NOT IN, ANY, ALL, EXISTS here,

char *opname; /* name of operator/function */
Node *lexpr; /* left argument */
Node *rexpr; /* right argument */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and SubSelect (Query) here (as possible case).

One thought to follow this way: RULEs (and so - VIEWs) are handled by using
Query - how else can we implement VIEWs on selects with subqueries ?

BTW, is

select * from A where (select TRUE from B);

valid syntax ?

Vadim

Vadim B. Mikheev

vadim@sable.krasnoyarsk.su

about 28 years ago

In reply to: Bruce Momjian (#1)

Re: [HACKERS] subselect

Goran Thyni wrote:

Vadim,

Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
is one example of this - joining by <> will give us invalid results.

What is you approach towards this problem?

Actually, this is problem of ALL modifier (NOT IN is _not_equal_ ALL)
and so, we have to have not just NOT EQUAL flag but some ALL node
with modified operator.

After that, one way is put subquery into inner plan of an join node
to be sure that for an outer tuple all corresponding subquery tuples
will be tested with modified operator (this will require either
changing code of all join nodes or addition of new plan type - we'll see)
and another way is ... suggested by you:

I got an idea that one could reverse the order,
that is execute the outer first into a temptable
and delete from that according to the result of the
subquery and then return it.
Probably this is too raw and slow. ;-)

This will be faster in some cases (when subquery returns many results
and there are "not so many" results from outer query) - thanks for idea!

Personally, I was stuck by holydays -:)
Now I can spend ~ 8 hours ~ each day for development...

Oh, isn't it christmas eve right now in Russia?

Due to historic reasons New Year is mu-u-u-uch popular
holiday in Russia -:)

Vadim

#10

Bruce Momjian

maillist@candle.pha.pa.us

about 28 years ago

In reply to: Vadim B. Mikheev (#8)

Re: [HACKERS] subselect

No. I just want to follow Tom's way: I would like to see new
SubSelect node as shortened version of struct Query (or use
Query structure for each subquery - no matter for me), some
subquery-related stuff added to Query (and SubSelect) to help
optimizer to start, and see

This is fine. I thought it would be too much work for the optimizer to
pass a subquery inside the WHERE clause, but if you think you can handle
it, great. I think it is more likely views will work with subqueries if
we do that too, and it is cleaner.

I recommend adding a boolean flag to the rangetable entries to show if
the range was added automatically, meaning it refers to an outer query.
Also, we will need a flag in the Query structure to tell if it is a
subquery, and a pointer to the parent's range table to resolve
references like:

select *
from taba
where col1 = (select col2
from tabb
where col3 = tabb.col4)

In this case, the proper table for col3 can not be determined from the
subquery range table, so we must search the parent range table to add
the proper entry to the child. If we add target entries at the same
time in the parser, we should add a flag to the targetentry structure to
identify it as an entry that will have to have additional WHERE clauses
added to the parent to restrict the join, or we could add those entries
in the parser, but at the time we are processing the subquery, we are
already inside the WHERE clause, so we must be careful.

--
Bruce Momjian
maillist@candle.pha.pa.us

#11

Vadim B. Mikheev

vadim@sable.krasnoyarsk.su

about 28 years ago

In reply to: Bruce Momjian (#7)

Re: [HACKERS] subselect

Bruce Momjian wrote:

I am confused. Do you want one flat query and want to pass the whole
thing into the optimizer? That brings up some questions:

No. I just want to follow Tom's way: I would like to see new
SubSelect node as shortened version of struct Query (or use
Query structure for each subquery - no matter for me), some
subquery-related stuff added to Query (and SubSelect) to help
optimizer to start, and see

OK, so you want the subquery to actually be INSIDE the outer query
expression. Do they share a common range table? If they don't, we

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No.

could very easily just fly through when processing the WHERE clause, and
start a new query using a new query structure for the subquery. Believe

^^^^^^^^^^^^^^^^^^^^^^^^^^^^
... and filling some subquery-related stuff in upper query structure -
still don't know what exactly this could be -:)

me, you don't want a separate SubQuery-type, just re-use Query for it.
It allows you to call all the normal query stuff with a consistent
structure.

No objections.

The parser will need to know it is in a subquery, so it can add the
proper target columns to the subquery, or are you going to do that in

I don't think that we need in it, but list of correlation clauses
could be good thing - all in all parser has to check all column
references...

the optimizer. You can do it in the optimizer, and join the range table
references there too.

Yes.

typedef struct A_Expr
{
NodeTag type;
int oper; /* type of operation
* {OP,OR,AND,NOT,ISNULL,NOTNULL} */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IN, NOT IN, ANY, ALL, EXISTS here,

char *opname; /* name of operator/function */
Node *lexpr; /* left argument */
Node *rexpr; /* right argument */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and SubSelect (Query) here (as possible case).

One thought to follow this way: RULEs (and so - VIEWs) are handled by using
Query - how else can we implement VIEWs on selects with subqueries ?

Views are stored as nodeout structures, and are merged into the query's
from list, target list, and where clause. I am working out
readfunc,outfunc now to make sure they are up-to-date with all the
current fields.

Nice! This stuff was out-of-date for too long time.

BTW, is

select * from A where (select TRUE from B);

valid syntax ?

I don't think so.

And so, *rexpr can be of Query type only for oper "in" OP, IN, NOT IN,
ANY, ALL, EXISTS - well.

(Time to sleep -:)

Vadim

#12

Thomas G. Lockhart

lockhart@alumni.caltech.edu

about 28 years ago

In reply to: Bruce Momjian (#4)

Re: [HACKERS] subselect

BTW, note that for _expression_ subqueries (which are introduced without
IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
we have to check that subquery returns single tuple...

It might be nice to have a tuple-counting operation or query node (is this the right
terminology?) which could be used to help implement EXISTS. It might help to
re-implement the count(*) function also.

- Tom

#13

Bruce Momjian

maillist@candle.pha.pa.us

about 28 years ago

In reply to: Thomas G. Lockhart (#12)

Re: [HACKERS] subselect

BTW, note that for _expression_ subqueries (which are introduced without
IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
we have to check that subquery returns single tuple...

It might be nice to have a tuple-counting operation or query node (is this the right
terminology?) which could be used to help implement EXISTS. It might help to
re-implement the count(*) function also.

In the new code, count(*) picks a column from one of the tables to count
on.

--
Bruce Momjian
maillist@candle.pha.pa.us