Allowing join removals for more join types
I'm currently in the early stages of looking into expanding join removals.
Currently left outer joins can be removed if none of the columns of the
table are required for anything and the table being joined is a base table
that contains a unique index on all columns in the join clause.
The case I would like to work on is to allow sub queries where the query is
grouped by or distinct on all of the join columns.
Take the following as an example:
CREATE TABLE products (productid integer NOT NULL, code character
varying(32) NOT NULL);
CREATE TABLE sales (saleid integer NOT NULL, productid integer NOT NULL,
qty integer NOT NULL);
CREATE VIEW product_sales AS
SELECT p.productid,
p.code,
s.qty
FROM (products p
LEFT JOIN ( SELECT sales.productid,
sum(sales.qty) AS qty
FROM sales
GROUP BY sales.productid) s ON ((p.productid = s.productid)));
If a user does:
SELECT productid,code FROM product_sales;
Then, if I'm correct, the join on sales can be removed.
As I said above, I'm in the early stages of looking at this and I'm
currently a bit confused. Basically I've put a breakpoint at the top of
the join_is_removable function and I can see that innerrel->rtekind
is RTE_SUBQUERY for my query, so the function is going to return false. So
what I need to so is get access to innerrel->subroot->parse so that I can
look at groupClause and distinctClause. The thing is innerrel->subroot is
NULL in this case, but I see a comment for subroot saying "subroot -
PlannerInfo for subquery (NULL if it's not a subquery)" so I guess this
does not also mean "subroot - PlannerInfo for subquery (NOT NULL if it's a
subquery)"?
Has anyone got any pointers to where I might be able to get the Query
details for the subquery? These structures are quite new to me.
Regards
David Rowley
On Sat, May 17, 2014 at 8:57 PM, David Rowley <dgrowleyml@gmail.com> wrote:
I'm currently in the early stages of looking into expanding join removals.
As I said above, I'm in the early stages of looking at this and I'm
currently a bit confused. Basically I've put a breakpoint at the top of
the join_is_removable function and I can see that innerrel->rtekind
is RTE_SUBQUERY for my query, so the function is going to return false. So
what I need to so is get access to innerrel->subroot->parse so that I can
look at groupClause and distinctClause. The thing is innerrel->subroot is
NULL in this case, but I see a comment for subroot saying "subroot -
PlannerInfo for subquery (NULL if it's not a subquery)" so I guess this
does not also mean "subroot - PlannerInfo for subquery (NOT NULL if it's a
subquery)"?Has anyone got any pointers to where I might be able to get the Query
details for the subquery? These structures are quite new to me.
I think I've managed to answer my own question here. But please someone
correct me if this sounds wrong.
It looks like the existing join removals are done quite early in the
planning and redundant joins are removed before any subqueries from that
query are planned. So this innerrel->subroot->parse has not been done yet.
It seems to be done later in query_planner() when make_one_rel() is called.
The best I can come up with on how to implement this is to have 2 stages of
join removals. Stage 1 would be the existing stage that attempts to remove
joins from non subqueries. Stage 2 would happen just after make_one_rel()
is called from query_planner(), this would be to attempt to remove any
subqueries that are not need, and if it managed to remove any it would
force a 2nd call to make_one_rel().
Does this sound reasonable or does it sound like complete non-sense?
Show quoted text
Regards
David Rowley
David Rowley <dgrowleyml@gmail.com> writes:
It looks like the existing join removals are done quite early in the
planning and redundant joins are removed before any subqueries from that
query are planned. So this innerrel->subroot->parse has not been done yet.
It seems to be done later in query_planner() when make_one_rel() is called.
It's true that we don't plan the subquery till later, but I don't see why
that's important here. Everything you want to know is available from the
subquery parsetree; so just look at the RTE, don't worry about how much
of the RelOptInfo has been filled in.
The best I can come up with on how to implement this is to have 2 stages of
join removals. Stage 1 would be the existing stage that attempts to remove
joins from non subqueries. Stage 2 would happen just after make_one_rel()
is called from query_planner(), this would be to attempt to remove any
subqueries that are not need, and if it managed to remove any it would
force a 2nd call to make_one_rel().
That sounds like a seriously bad idea. For one thing, it blows the
opportunity to not plan the subquery in the first place. For another,
most of these steps happen in a carefully chosen order because there
are interdependencies. You can't just go back and re-run some earlier
processing step. A large fraction of the complexity of analyzejoins.c
right now arises from the fact that it has to undo some earlier
processing; that would get enormously worse if you delayed it further.
BTW, just taking one step back ... this seems like a pretty specialized
requirement. Are you sure it wouldn't be easier to fix your app to
not generate such silly queries?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, May 18, 2014 at 2:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
It looks like the existing join removals are done quite early in the
planning and redundant joins are removed before any subqueries from that
query are planned. So this innerrel->subroot->parse has not been doneyet.
It seems to be done later in query_planner() when make_one_rel() is
called.
It's true that we don't plan the subquery till later, but I don't see why
that's important here. Everything you want to know is available from the
subquery parsetree; so just look at the RTE, don't worry about how much
of the RelOptInfo has been filled in.
Thanks. I think I've found what you're talking about in PlannerInfo
simple_rte_array.
That's got the ball rolling.
The best I can come up with on how to implement this is to have 2 stages
of
join removals. Stage 1 would be the existing stage that attempts to
remove
joins from non subqueries. Stage 2 would happen just after make_one_rel()
is called from query_planner(), this would be to attempt to remove any
subqueries that are not need, and if it managed to remove any it would
force a 2nd call to make_one_rel().That sounds like a seriously bad idea. For one thing, it blows the
opportunity to not plan the subquery in the first place. For another,
most of these steps happen in a carefully chosen order because there
are interdependencies. You can't just go back and re-run some earlier
processing step. A large fraction of the complexity of analyzejoins.c
right now arises from the fact that it has to undo some earlier
processing; that would get enormously worse if you delayed it further.
Agreed, but at the time I didn't know that the subquery information was
available elsewhere.
BTW, just taking one step back ... this seems like a pretty specialized
requirement. Are you sure it wouldn't be easier to fix your app to
not generate such silly queries?
Well, couldn't you say the same about any join removals? Of course the
query could be rewritten to not reference that relation, but there are
cases where removing the redundant join might be more silly, for example a
fairly complex view may exist and many use cases for the view don't require
all of the columns. It might be a bit of a pain to maintain a whole series
of views with each required subset of columns instead of just maintaining a
single view and allow callers to use what they need from it.
I came across the need for this at work this week where we have a grid in a
UI where the users can select columns that they need to see in the grid.
The data in each grid is supplied by a single view which contains all the
possible columns that they might need, if the user is just using a narrow
subset of those columns then it could seem quite wasteful to do more than
is required.
Regards
David Rowley
On Sat, May 17, 2014 at 8:57 PM, David Rowley <dgrowleyml@gmail.com> wrote:
I'm currently in the early stages of looking into expanding join removals.
Currently left outer joins can be removed if none of the columns of the
table are required for anything and the table being joined is a base table
that contains a unique index on all columns in the join clause.The case I would like to work on is to allow sub queries where the query
is grouped by or distinct on all of the join columns.Take the following as an example:
CREATE TABLE products (productid integer NOT NULL, code character
varying(32) NOT NULL);
CREATE TABLE sales (saleid integer NOT NULL, productid integer NOT NULL,
qty integer NOT NULL);CREATE VIEW product_sales AS
SELECT p.productid,
p.code,
s.qty
FROM (products p
LEFT JOIN ( SELECT sales.productid,
sum(sales.qty) AS qty
FROM sales
GROUP BY sales.productid) s ON ((p.productid = s.productid)));If a user does:
SELECT productid,code FROM product_sales;
Then, if I'm correct, the join on sales can be removed.
Attached is a patch which implements this. It's still a bit rough around
the edges and some names could likely do with being improved, but it at
least seems to work with all of the test cases that I've thrown at it so
far.
Comments are welcome, but the main purpose of the email is so I can
register the patch for the June commitfest.
Regards
David Rowley
Attachments:
subquery_leftjoin_removal_v0.5.patchapplication/octet-stream; name=subquery_leftjoin_removal_v0.5.patchDownload+299-9
On 18 May 2014 16:38 David Rowley Wrote
Sound like a good idea to me..
I have one doubt regarding the implementation, consider the below query
Create table t1 (a int, b int);
Create table t2 (a int, b int);
Create unique index on t2(b);
select x.a from t1 x left join (select distinct t2.a a1, t2.b b1 from t2) as y on x.a=y.b1; (because of distinct clause subquery will not be pulled up)
In this case, Distinct clause is used on t2.a, but t2.b is used for left Join (t2.b have unique index so this left join can be removed).
So I think now when you are considering this join removal for subqueries then this can consider other case also like unique index inside subquery,
because in attached patch unique index is considered only if its RTE_RELATION
+ if (innerrel->rtekind == RTE_RELATION &&
+ relation_has_unique_index_for(root, innerrel, clause_list, NIL, NIL))
return true;
Correct me if I am missing something..
CREATE TABLE products (productid integer NOT NULL, code character varying(32) NOT NULL);
CREATE TABLE sales (saleid integer NOT NULL, productid integer NOT NULL, qty integer NOT NULL);
CREATE VIEW product_sales AS
SELECT p.productid,
p.code,
s.qty
FROM (products p
LEFT JOIN ( SELECT sales.productid,
sum(sales.qty) AS qty
FROM sales
GROUP BY sales.productid) s ON ((p.productid = s.productid)));
If a user does:
SELECT productid,code FROM product_sales;
Then, if I'm correct, the join on sales can be removed.
Attached is a patch which implements this. It's still a bit rough around the edges and some names could likely do with being improved, but it at least seems to work with all of the test cases that I've thrown at it so far.
Comments are welcome, but the main purpose of the email is so I can register the patch for the June commitfest.
On Mon, May 19, 2014 at 5:47 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
On 18 May 2014 16:38 David Rowley Wrote
Sound like a good idea to me..
I have one doubt regarding the implementation, consider the below query
Create table t1 (a int, b int);
Create table t2 (a int, b int);
Create unique index on t2(b);
select x.a from *t1 x* left join (select *distinct t2.a a1*, *t2.b b1*from t2) as y on x.a=y.b1; (*because
of distinct clause subquery will not be pulled up*)In this case, Distinct clause is used on *t2.a, *but* t2.b *is used for
left Join (t2.b have unique index so this left join can be removed).So I think now when you are considering this join removal for subqueries
then this can consider other case also like unique index inside subquery,because in attached patch unique index is considered only if its
RTE_RELATION+ if (innerrel->rtekind == RTE_RELATION &&
+ relation_has_unique_index_for(root, innerrel,
clause_list, NIL, NIL))return true;
Correct me if I am missing something..
I think you are right here, it would be correct to remove that join, but I
also think that the query in question could be quite easily be written as:
select t1.a from t1 left join t2 on t1.a=t2.b;
Where the join WILL be removed. The distinct clause here technically is a
no-op due to all the columns of a unique index being present in the clause.
Can you think of a use case for this where the sub query couldn't have been
written out as a direct join to the relation?
What would be the reason to make it a sub query with the distinct? or have
I gotten something wrong here?
I'm also thinking here that if we made the join removal code remove these
joins, then the join removal code would end up smarter than the rest of the
code as the current code seems not to remove the distinct clause for single
table queries where a subset of the columns of a distinct clause match all
the columns of a unique index.
create table pktest (id int primary key);
explain (costs off) select distinct id from pktest;
QUERY PLAN
--------------------------
HashAggregate
Group Key: id
-> Seq Scan on pktest
This could have been rewritten to become: select id from pktest
I feel if we did that sort of optimisation to the join removals, then I'd
guess we'd better put it into other parts of the code too, perhaps
something that could do this should be in the re-writer then once the join
removal code gets to it, the join could be removed.
Can you think of a similar example where the subquery could not have been
written as a direct join to the relation?
Regards
David Rowley
On 19 May 2014 12:15 David Rowley Wrote,
I think you are right here, it would be correct to remove that join, but I also think that the query in question could be quite easily be written as:
select t1.a from t1 left join t2 on t1.a=t2.b;
Where the join WILL be removed. The distinct clause here technically is a no-op due to all the columns of a unique index being present in the clause. Can you think of a use case for this where the sub query couldn't have been written out as a direct join to the relation?
What would be the reason to make it a sub query with the distinct? or have I gotten something wrong here?
I'm also thinking here that if we made the join removal code remove these joins, then the join removal code would end up smarter than the rest of the code as the current code seems not to remove the distinct clause for single table queries where a subset of the columns of a distinct clause match all the columns of a unique index.
Can you think of a similar example where the subquery could not have been written as a direct join to the relation?
I think, you are write that above given query and be written in very simple join.
But what my point is, In any case when optimizer cannot pull up the subquery (because it may have aggregate, group by, order by, limit, distinct etc.. clause),
That time even, It will check Whether join is removable or not only when distinct or group by clause is there if it has unique index then it will not be check, is there no scenario where it will be useful ?
May be we can convert my above example like below --> in this case we have unique index on field a and we are limiting it by first 100 tuple (record are already order because of index)
Create table t1 (a int, b int);
Create table t2 (a int, b int);
Create unique index on t2(a);
create view v1 as
select x.a, y.b
from t1 x left join (select t2.a a1, b from t2 limit 100) as y on x.a=y.a1;
select a from v1; --> for this query I think left join can be removed, But in view since non join field(b) is also projected so this cannot be simplified there.
In your patch, anyway we are having check for distinct and group clause inside subquery, can’t we have check for unique index also ?
Regards,
Dilip
On Mon, May 19, 2014 at 9:22 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
On 19 May 2014 12:15 David Rowley Wrote,
May be we can convert my above example like below à in this case we
have unique index on field a and we are limiting it by first 100 tuple
(record are already order because of index)Create table t1 (a int, b int);
Create table t2 (a int, b int);
Create unique index on t2(a);
create view v1 as
select x.a, y.b
from t1 x left join (select t2.a a1, b from t2 limit 100) as y on
x.a=y.a1;select a from v1; à for this query I think left join can be removed, But
in view since non join field(b) is also projected so this cannot be
simplified there.
Ok I see what you mean.
I guess then that if we did that then we should also support removals of
join in subqueries of subqueries. e.g:
select t1.* from t1 left join (select t2.uniquecol from (select
t2.uniquecol from t2 limit 1000) t2 limit 100) t2 on t1.id = t2.uniquecol
On my first round of thoughts on this I thought that we could keep looking
into the sub queries until we find that the sub query only queries a single
table or it is not a base relation. If we find one with a single table and
the sub query has no distinct or group bys then I thought we could just
look at the unique indexes similar to how it's done now for a direct table
join. But after giving this more thought, I'm not quite sure if a lack of
DISTINCT and GROUP BY clause is enough for us to permit removing the join.
Would it matter if the sub query did a FOR UPDATE?
I started looking at is_simple_subquery() in prepjointree.c but if all
those conditions were met then the subquery would have been pulled up to a
direct join anyway.
I'm also now wondering if I need to do some extra tests in the existing
code to ensure that the subquery would have had no side affects.
For example:
SELECT t1.* FROM t1
LEFT OUTER JOIN (SELECT id,some_function_that_does_something(id) FROM t2
GROUP BY id) t2 ON t1.id = t2.id;
Regards
David Rowley
David Rowley <dgrowleyml@gmail.com> writes:
I'm also now wondering if I need to do some extra tests in the existing
code to ensure that the subquery would have had no side affects.
You should probably at least refuse the optimization if the subquery's
tlist contains volatile functions.
Functions that return sets might be problematic too [ experiments... ]
Yeah, they are. This behavior is actually a bit odd:
regression=# select q1 from int8_tbl;
q1
------------------
123
123
4567890123456789
4567890123456789
4567890123456789
(5 rows)
regression=# select q1 from int8_tbl group by 1;
q1
------------------
4567890123456789
123
(2 rows)
regression=# select q1,unnest(array[1,2]) as u from int8_tbl;
q1 | u
------------------+---
123 | 1
123 | 2
123 | 1
123 | 2
4567890123456789 | 1
4567890123456789 | 2
4567890123456789 | 1
4567890123456789 | 2
4567890123456789 | 1
4567890123456789 | 2
(10 rows)
regression=# select q1,unnest(array[1,2]) as u from int8_tbl group by 1;
q1 | u
------------------+---
4567890123456789 | 1
4567890123456789 | 2
123 | 1
123 | 2
(4 rows)
EXPLAIN shows that the reason the last case behaves like that is that
the SRF is expanded *after* the grouping step. I'm not entirely sure if
that's a bug --- given the lack of complaints, perhaps not. But it shows
you can't apply this optimization without changing the existing behavior.
I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, May 20, 2014 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
I'm also now wondering if I need to do some extra tests in the existing
code to ensure that the subquery would have had no side affects.You should probably at least refuse the optimization if the subquery's
tlist contains volatile functions.Functions that return sets might be problematic too [ experiments... ]
Yeah, they are. This behavior is actually a bit odd:
...
regression=# select q1,unnest(array[1,2]) as u from int8_tbl group by 1;
q1 | u
------------------+---
4567890123456789 | 1
4567890123456789 | 2
123 | 1
123 | 2
(4 rows)EXPLAIN shows that the reason the last case behaves like that is that
the SRF is expanded *after* the grouping step. I'm not entirely sure if
that's a bug --- given the lack of complaints, perhaps not. But it shows
you can't apply this optimization without changing the existing behavior.I doubt you should drop a subquery containing FOR UPDATE, either.
That's a side effect, just as much as a volatile function would be.regards, tom lane
Yeah that is strange indeed.
I've made some updates to the patch to add some extra checks for any
volatile functions in the target list and set returning functions.
The FOR UPDATE currently does not really need an explicit check as I'm
currently only supporting removals of sub queries that have either GROUP BY
or DISTINCT clauses, none of which allow FOR UPDATE anyway.
Regards
David Rowley
Attachments:
subquery_leftjoin_removal_v0.6.patchapplication/octet-stream; name=subquery_leftjoin_removal_v0.6.patchDownload+354-9
On Mon, May 19, 2014 at 5:47 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
So I think now when you are considering this join removal for subqueries
then this can consider other case also like unique index inside subquery,because in attached patch unique index is considered only if its
RTE_RELATION+ if (innerrel->rtekind == RTE_RELATION &&
+ relation_has_unique_index_for(root, innerrel,
clause_list, NIL, NIL))return true;
I've just had a bit of a look at implementing checks allowing subqueries
with unique indexes on the join cols being removed, but I'm hitting a bit
of a problem and I'm not quite sure if this is possible at this stage of
planning.
In the function join_is_removable() the variable innerrel is set to the
RelOptInfo of the relation which we're checking if we can remove. In the
case of removing subqueries the innerrel->rtekind will be RTE_SUBQUERY. I
started going over the pre-conditions that the sub query will need to meet
for this to be possible and the list so far looks something like:
1. Only a single base table referenced in the sub query.
2. No FOR UPDATE clause
3. No GROUP BY or DISTINCT clause
4. No set returning functions
5. no volatile functions.
6. has unique index that covers the join conditions or a subset of.
I'm hitting a bit of a roadblock on point 1. Here's a snipped from my
latest attempt:
if (bms_membership(innerrel->relids) == BMS_SINGLETON)
{
int subqueryrelid = bms_singleton_member(innerrel->relids);
RelOptInfo *subqueryrel = find_base_rel(innerrel->subroot, subqueryrelid);
if (relation_has_unique_index_for(root, subqueryrel, clause_list, NIL,
NIL))
return true;
}
But it seems that innerrel->subroot is still NULL at this stage of planning
and from what I can tell does not exist anywhere else yet and is not
generated until make_one_rel() is called from query_planner()
Am I missing something major here,or does this sound about right?
Regards
David Rowley
On 23 May 2014 12:43 David Rowley Wrote,
I'm hitting a bit of a roadblock on point 1. Here's a snipped from my latest attempt:
if (bms_membership(innerrel->relids) == BMS_SINGLETON)
{
int subqueryrelid = bms_singleton_member(innerrel->relids);
RelOptInfo *subqueryrel = find_base_rel(innerrel->subroot, subqueryrelid);if (relation_has_unique_index_for(root, subqueryrel, clause_list, NIL, NIL))
return true;
}
But it seems that innerrel->subroot is still NULL at this stage of planning and from what I can tell does not exist anywhere else yet and is not generated until make_one_rel() is called from query_planner()
Am I missing something major here,or does this sound about right?
It’s true that, till this point of time we haven’t prepared the base relation list for the subquery, and that will be done from make_one_rel while generating the SUBQURY path list.
I can think of one solution but I think it will be messy…
We get the base relation info directly from subquery
Like currently in your patch (shown in below snippet) we are getting the distinct and groupby clause from sub Query, similarly we can get base relation info from (Query->jointree)
if (innerrel->rtekind == RTE_SUBQUERY)
{
Query *query = root->simple_rte_array[innerrelid]->subquery;
if (sortclause_is_unique_on_restrictinfo(query, clause_list, query->groupClause) ||
sortclause_is_unique_on_restrictinfo(query, clause_list, query->distinctClause))
return true;
}
Regards,
Dilip
On Fri, May 23, 2014 at 8:28 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
On 23 May 2014 12:43 David Rowley Wrote,
I'm hitting a bit of a roadblock on point 1. Here's a snipped from my
latest attempt:
if (bms_membership(innerrel->relids) ==
BMS_SINGLETON)
{
int subqueryrelid =
bms_singleton_member(innerrel->relids);
RelOptInfo *subqueryrel =
find_base_rel(innerrel->subroot, subqueryrelid);
if (relation_has_unique_index_for(root,
subqueryrel, clause_list, NIL, NIL))
return true;
}
But it seems that innerrel->subroot is still NULL at this stage of
planning and from what I can tell does not exist anywhere else yet and is
not generated until make_one_rel() is called from query_planner()Am I missing something major here,or does this sound about right?
It’s true that, till this point of time we haven’t prepared the base
relation list for the subquery, and that will be done from make_one_rel
while generating the SUBQURY path list.I can think of one solution but I think it will be messy…
We get the base relation info directly from subquery
Like currently in your patch (shown in below snippet) we are getting the
distinct and groupby clause from sub Query, similarly we can get base
relation info from (Query->jointree)if (innerrel->rtekind == RTE_SUBQUERY)
{
Query *query =
root->simple_rte_array[innerrelid]->subquery;if (sortclause_is_unique_on_restrictinfo(query,
clause_list, query->groupClause) ||sortclause_is_unique_on_restrictinfo(query, clause_list,
query->distinctClause))return true;
}
I'm getting the idea that this is just not the right place in planning to
do this for subqueries.
You seem to be right about the messy part too
Here's a copy and paste of the kludge I've ended up with while testing this
out:
if (list_length(subquery->jointree->fromlist) == 1)
{
RangeTblEntry *base_rte;
RelOptInfo *subqueryrelid;
RangeTblRef *rtr = (RangeTblRef *) linitial(subquery->jointree->fromlist);
if (!IsA(rtr, RangeTblRef))
return false;
base_rte = rt_fetch(rtr->rtindex, subquery->rtable);
if (base_rte->relkind != RTE_RELATION)
return false;
subqueryrelid = build_simple_rel(<would have to fake this>, rtr->rtindex,
RELOPT_BASEREL);
I don't have a PlannerInfo to pass to build_simple_rel and it just seems
like a horrid hack to create one that we're not going to be keeping.
Plus It would be a real shame to have to call build_simple_rel() for the
same relation again when we plan the subquery later.
I'm getting the idea that looking for unique indexes on the sub query is
not worth the hassle for now. Don't get me wrong, they'd be nice to have,
but I just think that it's a less common use case and these are more likely
to have been pulled up anyway.
Unless there's a better way, I think I'm going to spend the time looking
into inner joins instead.
Regards
David Rowley
Show quoted text
Regards,
Dilip
David Rowley <dgrowleyml@gmail.com> writes:
I've just had a bit of a look at implementing checks allowing subqueries
with unique indexes on the join cols being removed,
I'm a bit confused by this statement of the problem. I thought the idea
was to recognize that subqueries with DISTINCT or GROUP BY clauses produce
known-unique output column(s), which permits join removal in the same way
that unique indexes on a base table allow us to deduce that certain
columns are known-unique and hence can offer no more than one match for
a join. That makes it primarily a syntactic check, which you can perform
despite the fact that the subquery hasn't been planned yet (since the
parser has done sufficient analysis to determine the semantics of
DISTINCT/GROUP BY).
Drilling down into the subquery is a whole different matter. For one
thing, there's no point in targeting cases in which the subquery would be
eligible to be flattened into the parent query, and your proposed list of
restrictions seems to eliminate most cases in which it couldn't be
flattened. For another, you don't have access to any planning results for
the subquery yet, which is the immediate problem you're complaining of.
Duplicating the work of looking up a relation's indexes seems like a
pretty high price to pay for whatever improvement you might get here.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, May 24, 2014 at 3:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
I've just had a bit of a look at implementing checks allowing subqueries
with unique indexes on the join cols being removed,I'm a bit confused by this statement of the problem. I thought the idea
was to recognize that subqueries with DISTINCT or GROUP BY clauses produce
known-unique output column(s), which permits join removal in the same way
that unique indexes on a base table allow us to deduce that certain
columns are known-unique and hence can offer no more than one match for
a join. That makes it primarily a syntactic check, which you can perform
despite the fact that the subquery hasn't been planned yet (since the
parser has done sufficient analysis to determine the semantics of
DISTINCT/GROUP BY).
Up thread a little Dilip was talking about in addition to checking that if
the sub query could be proved to be unique on the join condition using
DISTINCT/GROUP BY, we might also check unique indexes in the subquery to
see if they could prove the query is unique on the join condition.
For example a query such as:
SELECT a.* FROM a LEFT JOIN (SELECT b.* FROM b LIMIT 1) b ON a.column =
b.colwithuniqueidx
The presence of the LIMIT would be enough to stop the subquery being pulled
up, but there'd be no reason to why the join couldn't be removed.
I think the use case for this is likely a bit more narrow than the GROUP
BY/DISTINCT case, so I'm planning on using the time on looking into more
common cases such as INNER JOINs where we can prove the existence of the
row using a foreign key.
Drilling down into the subquery is a whole different matter. For one
thing, there's no point in targeting cases in which the subquery would be
eligible to be flattened into the parent query, and your proposed list of
restrictions seems to eliminate most cases in which it couldn't be
flattened. For another, you don't have access to any planning results for
the subquery yet, which is the immediate problem you're complaining of.
Duplicating the work of looking up a relation's indexes seems like a
pretty high price to pay for whatever improvement you might get here.
I agree that there are not many cases left to remove the join that remain
after is_simple_subquery() has decided not to pullup the subquery. Some of
the perhaps more common cases would be having windowing functions in the
subquery as this is what you need to do if you want to include the results
of a windowing function from within the where clause. Another case, though
I can't imagine it would be common, is ORDER BY in the subquery... But for
that one I can't quite understand why is_simple_subquery() stops that being
flattened in the first place.
Regards
David Rowley
David Rowley <dgrowleyml@gmail.com> writes:
I agree that there are not many cases left to remove the join that remain
after is_simple_subquery() has decided not to pullup the subquery. Some of
the perhaps more common cases would be having windowing functions in the
subquery as this is what you need to do if you want to include the results
of a windowing function from within the where clause. Another case, though
I can't imagine it would be common, is ORDER BY in the subquery... But for
that one I can't quite understand why is_simple_subquery() stops that being
flattened in the first place.
The problem there is that (in general) pushing qual conditions to below a
window function will change the window function's results. If we flatten
such a subquery then the outer query's quals can get evaluated before
the window function, so that's no good. Another issue is that flattening
might cause the window function call to get copied to places in the outer
query where it can't legally go, such as the WHERE clause.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, May 23, 2014 at 11:45 PM, David Rowley <dgrowleyml@gmail.com> wrote:
I'm getting the idea that looking for unique indexes on the sub query is
not worth the hassle for now. Don't get me wrong, they'd be nice to have,
but I just think that it's a less common use case and these are more likely
to have been pulled up anyway.Unless there's a better way, I think I'm going to spend the time looking
into inner joins instead.
I've been working on adding join removal for join types other than left
outer joins.
The attached patch allows join removals for both sub queries with left
joins and also semi joins where a foreign key can prove the existence of
the record.
My longer term plan is to include inner joins too, but now that I have
something to show for semi joins, I thought this would be a good time to
post the patch just in case anyone can see any show stopper's with using
foreign keys this way.
So with the attached you can do:
CREATE TABLE b (id INT NOT NULL PRIMARY KEY);
CREATE TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES
b(id));
EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
QUERY PLAN
---------------
Seq Scan on a
(1 row)
I think anti joins could use the same infrastructure but I'm not quite sure
yet how to go about replacing the join with something like WHERE false.
I do think semi and anti joins are a far less useful case for join removals
as inner joins are, but if we're already loading the foreign key
constraints at plan time, then it seems like something that might be worth
while checking.
Oh, quite likely the code that loads the foreign key constraints needs more
work and probably included in the rel cache, but I don't want to go and to
that until I get some feedback on the work so far.
Any comments are welcome.
Thanks
David Rowley
Attachments:
join_removal_793f19f_2014-05-28.patchapplication/octet-stream; name=join_removal_793f19f_2014-05-28.patchDownload+1099-25
On Sun, May 25, 2014 at 5:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
I agree that there are not many cases left to remove the join that remain
after is_simple_subquery() has decided not to pullup the subquery. Someof
the perhaps more common cases would be having windowing functions in the
subquery as this is what you need to do if you want to include theresults
of a windowing function from within the where clause. Another case,
though
I can't imagine it would be common, is ORDER BY in the subquery... But
for
that one I can't quite understand why is_simple_subquery() stops that
being
flattened in the first place.
The problem there is that (in general) pushing qual conditions to below a
window function will change the window function's results. If we flatten
such a subquery then the outer query's quals can get evaluated before
the window function, so that's no good. Another issue is that flattening
might cause the window function call to get copied to places in the outer
query where it can't legally go, such as the WHERE clause.
I should have explained more clearly. I was meaning that a query such as
this:
SELECT a.* FROM a LEFT OUTER JOIN (SELECT id,LAG(id) OVER (ORDER BY id) AS
prev_id FROM b) b ON a.id=b.id;
assuming that id is the primary key, could have the join removed.
I was just commenting on this as it's probably a fairly common thing to
have a subquery with windowing functions in order to perform some sort of
filtering of window function columns in the outer query.
The other use cases for example:
SELECT a.* FROM a LEFT OUTER JOIN (SELECT id FROM b LIMIT 10) b ON a.id=b.id
;
Are likely less common.
Regards
David Rowley
On Wed, May 28, 2014 at 8:39 PM, David Rowley <dgrowleyml@gmail.com> wrote:
I've been working on adding join removal for join types other than left
outer joins.The attached patch allows join removals for both sub queries with left
joins and also semi joins where a foreign key can prove the existence of
the record.My longer term plan is to include inner joins too, but now that I have
something to show for semi joins, I thought this would be a good time to
post the patch just in case anyone can see any show stopper's with using
foreign keys this way.So with the attached you can do:
CREATE TABLE b (id INT NOT NULL PRIMARY KEY);
CREATE TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT NOT NULL REFERENCES
b(id));EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
QUERY PLAN
---------------
Seq Scan on a
(1 row)I think anti joins could use the same infrastructure but I'm not quite
sure yet how to go about replacing the join with something like WHERE false.I do think semi and anti joins are a far less useful case for join
removals as inner joins are, but if we're already loading the foreign key
constraints at plan time, then it seems like something that might be worth
while checking.Oh, quite likely the code that loads the foreign key constraints needs
more work and probably included in the rel cache, but I don't want to go
and to that until I get some feedback on the work so far.Any comments are welcome.
The attached patch fixes a problem with SEMI join removal where I was
missing adding a WHERE col IS NOT NULL check after a successful join
removal. This filter is required to keep the query equivalent as the semi
join would have filtered out the rows with a NULL join condition columns on
the left side of the join.
In the attached I've also added support for ANTI joins, where the join can
be removed it is replaced with a WHERE col IS NULL on the relation on the
left side of the join. This is required as the only possible columns that
could have matched would be NULL valued columns that are part of the
foreign key.
I'm not quite there with inner joins yet. I'm still getting my head around
just where the join quals are actually stored.
This area of the code is quite new to me, so I'm not quite sure I'm going
about things in the correct way.
To make my intentions clean with this patch I've marked the file name with
WIP.
Comments are welcome.
Regards
David Rowley