Parameterized aggregate subquery (was: Pull up aggregate subquery)

Started by Hitoshi Haradaalmost 15 years ago26 messageshackers
Jump to latest
#1Hitoshi Harada
umi.tanuki@gmail.com

Purpose & Goal
--------------
Allow planner to create NestLoop with parameterized aggregate subquery
just like inner IndexScan pattern. This helps to avoid unnecessary
aggregate process that would be filtered at the stage of upper join
filter in such case:

create table size_m as select i as id, repeat(i::text, i % 100) as val
from generate_series(1, 20000) i;
create table size_l as select i as id, m.id as m_id, repeat(i::text, i
% 100) as val from generate_series(1, 1000000) i, size_m m where i.i /
10 = m.id;
analyze size_m;
analyze size_l;
---- make this query faster
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';

In addition, it might be the first step to take account of "general
parameterized scan" design, although this proposal contains only
narrow cases of aggregate subquery.

Related information
-------------------
This work is the next step of the previously proposed "Pull up
aggregate subqery" thread.
http://archives.postgresql.org/message-id/BANLkTimW-EqS_9hk5AYj14R8U%3DuQoc6tuA%40mail.gmail.com

Design
------
In contract to the previous design, I made aggregate subquery
"parameterized" instead of pulling it up. The design is based on the
discussion with Tom Lane and Rober Haas et al. It has some benefits
compared with the pulling up appoach as, 1) parameterizing any scan
node other than index scan as nest loop's inner is our way to go, 2)
pulling up or pushing down any aggregate query has potential risk that
the aggregate results are wrong (, which may be solvable by adding
some constraints like "only when unique" etc,) 3) the decision whether
to make parameterized or not can be made by only cost without any
heuristics.

The idea is very similar to inner IndexScan. When NestLoop path is
created in match_unsorted_outer(), parameterized SubqueryScan path is
also considered, similarly to inner IndexScan paths. While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive.

To do so, I added SubqueryPath node to store each subquery
information. Before patching, subplan, subrtable and subrowmark is
stored in RelOptInfo, but now we need to save them separately for each
parameterized one. This part might be split from the original patch
for easy review.

Result
------
Without breakage of regression test, it successfully makes subquery as
parameterized as the inner of NestLoop. Query performance is as aimed.

(without patch)
db1=# explain analyze
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=79393.64..82937.42 rows=100 width=12) (actual
time=1256.569..1733.903 rows=1 loops=1)
Join Filter: (m.id = size_l.m_id)
-> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=4)
(actual time=25.182..32.237 rows=1 loops=1)
Filter: (val = '10101'::text)
-> GroupAggregate (cost=79393.64..81592.65 rows=19901 width=277)
(actual time=772.791..1694.848 rows=20000 loops=1)
-> Sort (cost=79393.64..79893.64 rows=200000 width=277)
(actual time=772.754..929.225 rows=200000 loops=1)
Sort Key: size_l.m_id
Sort Method: external sort Disk: 56848kB
-> Seq Scan on size_l (cost=0.00..9830.00 rows=200000
width=277) (actual time=0.030..198.093 rows=200000 loops=1)
Total runtime: 1754.111 ms
(10 rows)

(with patch)
db1=# explain analyze
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..11674.86 rows=100 width=12) (actual
time=119.386..122.478 rows=1 loops=1)
Join Filter: (m.id = size_l.m_id)
-> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=4)
(actual time=9.720..12.811 rows=1 loops=1)
Filter: (val = '10101'::text)
-> GroupAggregate (cost=0.00..10330.08 rows=1 width=277) (actual
time=109.639..109.640 rows=1 loops=1)
-> Seq Scan on size_l (cost=0.00..10330.00 rows=10
width=277) (actual time=59.138..109.611 rows=10 loops=1)
Filter: (m.id = m_id)
Total runtime: 122.612 ms
(8 rows)

I tested some more cases like "where val IN ('1', '10101')", "where
val between '1' and '10101'", which shows as good as I expected.

Concern
-------
Although execution time will be shorter in many cases, planning time
will be longer. Especially re-planning subquery with pushing join
quals for each joinrel step. I didn't measure the additional cost in
planner stage.

BTW, as I changed title and design from the previous post, should I
throw away the old commit fest entry and make the new one?

Regards,

--
Hitoshi Harada

Attachments:

aggjoin-20110609.patchapplication/octet-stream; name=aggjoin-20110609.patchDownload+311-56
#2Robert Haas
robertmhaas@gmail.com
In reply to: Hitoshi Harada (#1)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

BTW, as I changed title and design from the previous post, should I
throw away the old commit fest entry and make the new one?

Nah, just edit the existing entry and change the title.

Also add a link to the new patch, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Robert Haas (#2)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/6/9 Robert Haas <robertmhaas@gmail.com>:

On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

BTW, as I changed title and design from the previous post, should I
throw away the old commit fest entry and make the new one?

Nah, just edit the existing entry and change the title.

Also add a link to the new patch, of course.

Ok, done.

--
Hitoshi Harada

#4Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Hitoshi Harada (#3)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/6/10 Hitoshi Harada <umi.tanuki@gmail.com>:

2011/6/9 Robert Haas <robertmhaas@gmail.com>:

On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

BTW, as I changed title and design from the previous post, should I
throw away the old commit fest entry and make the new one?

Nah, just edit the existing entry and change the title.

Also add a link to the new patch, of course.

Ok, done.

While reviewing the gist/box patch, I found some planner APIs that can
replace parts in my patch. Also, comments in includes wasn't updated
appropriately. Revised patch attached.

Regards,

--
Hitoshi Harada

Attachments:

aggjoin-20110617.patchapplication/octet-stream; name=aggjoin-20110617.patchDownload+318-63
#5Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#4)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

On 2011-06-17 09:54, Hitoshi Harada wrote:

While reviewing the gist/box patch, I found some planner APIs that can
replace parts in my patch. Also, comments in includes wasn't updated
appropriately. Revised patch attached.

Hello Hitoshi-san,

I read your latest patch implementing parameterizing subquery scans.

1)
In the email from june 9 with the patch You wrote: "While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive."

Initial concerns I had were caused by misinterpreting 'replanning' as:
for each outer tuple, replan the subquery (it sounds a bit like
'ReScan'). Though the general comments in the patch are helpful, I think
it would be an improvement to describe why subqueries are planned more
than once, i.e. something like
"best_inner_subqueryscan
Try to find a better subqueryscan path and its associated plan for
each join clause that can be pushed down, in addition to the path that
is already calculated by set_subquery_pathlist."

The consequences of having multiple subquery paths and plans for each
new subquery path makes the bulk of the remainder of the patch.

2)
Since 'subquery_is_pushdown_safe' is invariant under join clause push
down, it might be possible to have it called only once in
set_subquery_pathlist, i.e. by only attaching rel->preprocessed_subquery
if the subquery_is_pushdown_safe.

3)
/*
* set_subquery_pathlist
* Build the (single) access path for a subquery RTE
*/
This unchanged comment is still correct, but 'the (single) access path'
might have become a bit misleading now there are multiple paths
possible, though not by set_subquery_pathlist.

4) somewhere in the patch
s/regsitered/registered/

5) Regression tests are missing; I think at this point they'd aid in
speeding up development/test cycles.

6) Before patching Postgres, I could execute the test with the size_l
and size_m tables, after patching against current git HEAD (patch
without errors), I get the following error when running the example query:
ERROR: plan should not reference subplan's variable

I get the same error with the version from june 9 with current git HEAD.

7) Since both set_subquery_pathlist and best_inner_subqueryscan push
down clauses, as well as add a path and subplan, could a generalized
function be made to support both, to reduce duplicate code?

regards,
Yeb Havinga

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#6Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#5)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:

On 2011-06-17 09:54, Hitoshi Harada wrote:

While reviewing the gist/box patch, I found some planner APIs that can
replace parts in my patch. Also, comments in includes wasn't updated
appropriately. Revised patch attached.

Hello Hitoshi-san,

Hi Yeb,

I read your latest patch implementing parameterizing subquery scans.

6) Before patching Postgres, I could execute the test with the size_l and
size_m tables, after patching against current git HEAD (patch without
errors), I get the following error when running the example query:
ERROR:  plan should not reference subplan's variable

I get the same error with the version from june 9 with current git HEAD.

It seems that something has changed since I developed the patch at
first. The last one and the one before are not so different with each
other, especially in terms of runtime but might not be tested enough.
I tried time-slip of the local git branch to around june 10, but the
same error occurs. The error message itself is not in parts changed
recently, so I guess some surrounding change would affect it.

7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
clauses, as well as add a path and subplan, could a generalized function be
made to support both, to reduce duplicate code?

I tried it but decided as it is, for the future extensibility. The
subquery parameterizing will (can) be extended more complicatedly. I'm
not sure if we'd better gathering some small parts into one by
throwing future capability.

Other things are all good points. Thanks for elaborate review!
More than anything, I'm going to fix the 6) issue, at least to find the cause.

Regards,
--
Hitoshi Harada

#7Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#6)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

On 2011-06-29 19:22, Hitoshi Harada wrote:

Other things are all good points. Thanks for elaborate review!
More than anything, I'm going to fix the 6) issue, at least to find the cause.

Some more questions:
8) why are cheapest start path and cheapest total path in
best_inner_subqueryscan the same?

9) as remarked up a different thread, more than one clause could be
pushed down a subquery. The current patch only considers alternative
paths that have at most one clause pushed down. Is this because of the
call site of best_inner_subqueryscan, i.e. under one join rel? Would it
be an idea to consider, for each subquery, only a single alternative
plan that tries to have all suitable clauses pushed down?

10) I have a hard time imagining use cases that will actually result in
a alternative plan, especially since not all subqueries are allowed to
have quals pushed down into, and like Simon Riggs pointed out that many
users will write queries like this with the subqueries pulled up. If it
is the case that the subqueries that can't be pulled up have a large
overlap with the ones that are not pushdown safe (limit, set operations
etc), there might be little actual use cases for this patch.

I think the most important thing for this patch to go forward is to have
a few examples, from which it's clear that the patch is beneficial.

regards,

--

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#8Yeb Havinga
yebhavinga@gmail.com
In reply to: Yeb Havinga (#7)
Re: Parameterized aggregate subquery

On 2011-06-30 09:39, Yeb Havinga wrote:

9) as remarked up a different thread, more than one clause could be
pushed down a subquery. The current patch only considers alternative
paths that have at most one clause pushed down. Is this because of the
call site of best_inner_subqueryscan, i.e. under one join rel? Would
it be an idea to consider, for each subquery, only a single
alternative plan that tries to have all suitable clauses pushed down?

Ehm, please forget this remark, I've mistaken join rel for base rel.

-- Yeb

#9Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#7)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/6/30 Yeb Havinga <yebhavinga@gmail.com>:

On 2011-06-29 19:22, Hitoshi Harada wrote:

Other things are all good points. Thanks for elaborate review!
More than anything, I'm going to fix the 6) issue, at least to find the
cause.

Some more questions:
8) why are cheapest start path and cheapest total path in
best_inner_subqueryscan the same?

Because best_inner_indexscan has the two. Actually one of them is
enough so far. I aligned it as the existing interface but they might
be one.

10) I have a hard time imagining use cases that will actually result in a
alternative plan, especially since not all subqueries are allowed to have
quals pushed down into, and like Simon Riggs pointed out that many users
will write queries like this with the subqueries pulled up. If it is the
case that the subqueries that can't be pulled up have a large overlap with
the ones that are not pushdown safe (limit, set operations etc), there might
be little actual use cases for this patch.

I have seen many cases that this planner hack would help
significantly, which were difficult to rewrite. Why were they
difficult to write? Because, quals on size_m (and they have quals on
size_l in fact) are usually very complicated (5-10 op clauses) and the
join+agg part itself is kind of subquery in other big query. Of course
there were workaround like split the statement to two, filtering
size_m then aggregate size_l by the result of the first statement.
However, it's against instinct. The reason why planner is in RDBMS is
to let users to write simple (as needed) statements. I don't know if
the example I raise here is common or not, but I believe the example
represents "one to many" relation simply, therefore there should be
many users who just don't find themselves currently in the slow query
performance.

I think the most important thing for this patch to go forward is to have a
few examples, from which it's clear that the patch is beneficial.

What will be good examples to show benefit of the patch? I guess the
test case of size_m/size_l shows it. What lacks on the case, do you
think?

Regards,

--
Hitoshi Harada

#10Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#5)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:

On 2011-06-17 09:54, Hitoshi Harada wrote:

While reviewing the gist/box patch, I found some planner APIs that can
replace parts in my patch. Also, comments in includes wasn't updated
appropriately. Revised patch attached.

Hello Hitoshi-san,

I read your latest patch implementing parameterizing subquery scans.

Attached is revised version.

1)
In the email from june 9 with the patch You wrote: "While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive."

Initial concerns I had were caused by misinterpreting 'replanning' as: for
each outer tuple, replan the subquery (it sounds a bit like 'ReScan').
Though the general comments in the patch are helpful, I think it would be an
improvement to describe why subqueries are planned more than once, i.e.
something like
"best_inner_subqueryscan
  Try to find a better subqueryscan path and its associated plan for each
join clause that can be pushed down, in addition to the path that is already
calculated by set_subquery_pathlist."

I changed comments around set_subquery_pathlist and best_inner_subqueryscan.

2)
Since 'subquery_is_pushdown_safe' is invariant under join clause push down,
it might be possible to have it called only once in set_subquery_pathlist,
i.e. by only attaching rel->preprocessed_subquery if the
subquery_is_pushdown_safe.

I modified as you suggested.

3)
/*
 * set_subquery_pathlist
 *        Build the (single) access path for a subquery RTE
 */
This unchanged comment is still correct, but 'the (single) access path'
might have become a bit misleading now there are multiple paths possible,
though not by set_subquery_pathlist.

As noted #1.

4) somewhere in the patch
s/regsitered/registered/

Fixed.

5) Regression tests are missing; I think at this point they'd aid in
speeding up development/test cycles.

I'm still thinking about it. I can add complex test but the concept of
regression test focuses on small pieces of simple cases. I don't want
take pg_regress much more than before.

6) Before patching Postgres, I could execute the test with the size_l and
size_m tables, after patching against current git HEAD (patch without
errors), I get the following error when running the example query:
ERROR:  plan should not reference subplan's variable

I get the same error with the version from june 9 with current git HEAD.

Fixed. Some variable initializing was wrong.

7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
clauses, as well as add a path and subplan, could a generalized function be
made to support both, to reduce duplicate code?

No touch as answered before.

Although I still need to think about suitable regression test case,
the patch itself can be reviewed again. You may want to try some
additional tests as you imagine after finding my test case gets
quicker.

Thanks,

--
Hitoshi Harada

#11Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Hitoshi Harada (#10)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/7/2 Hitoshi Harada <umi.tanuki@gmail.com>:

2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:

On 2011-06-17 09:54, Hitoshi Harada wrote:

While reviewing the gist/box patch, I found some planner APIs that can
replace parts in my patch. Also, comments in includes wasn't updated
appropriately. Revised patch attached.

Hello Hitoshi-san,

I read your latest patch implementing parameterizing subquery scans.

Attached is revised version.

I failed to attached the patch. I'm trying again.

1)
In the email from june 9 with the patch You wrote: "While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive."

Initial concerns I had were caused by misinterpreting 'replanning' as: for
each outer tuple, replan the subquery (it sounds a bit like 'ReScan').
Though the general comments in the patch are helpful, I think it would be an
improvement to describe why subqueries are planned more than once, i.e.
something like
"best_inner_subqueryscan
  Try to find a better subqueryscan path and its associated plan for each
join clause that can be pushed down, in addition to the path that is already
calculated by set_subquery_pathlist."

I changed comments around set_subquery_pathlist and best_inner_subqueryscan.

2)
Since 'subquery_is_pushdown_safe' is invariant under join clause push down,
it might be possible to have it called only once in set_subquery_pathlist,
i.e. by only attaching rel->preprocessed_subquery if the
subquery_is_pushdown_safe.

I modified as you suggested.

3)
/*
 * set_subquery_pathlist
 *        Build the (single) access path for a subquery RTE
 */
This unchanged comment is still correct, but 'the (single) access path'
might have become a bit misleading now there are multiple paths possible,
though not by set_subquery_pathlist.

As noted #1.

4) somewhere in the patch
s/regsitered/registered/

Fixed.

5) Regression tests are missing; I think at this point they'd aid in
speeding up development/test cycles.

I'm still thinking about it. I can add complex test but the concept of
regression test focuses on small pieces of simple cases. I don't want
take pg_regress much more than before.

6) Before patching Postgres, I could execute the test with the size_l and
size_m tables, after patching against current git HEAD (patch without
errors), I get the following error when running the example query:
ERROR:  plan should not reference subplan's variable

I get the same error with the version from june 9 with current git HEAD.

Fixed. Some variable initializing was wrong.

7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
clauses, as well as add a path and subplan, could a generalized function be
made to support both, to reduce duplicate code?

No touch as answered before.

Although I still need to think about suitable regression test case,
the patch itself can be reviewed again. You may want to try some
additional tests as you imagine after finding my test case gets
quicker.

Thanks,

--
Hitoshi Harada

--
Hitoshi Harada

Attachments:

aggjoin-20110702.patchtext/x-patch; charset=US-ASCII; name=aggjoin-20110702.patchDownload+332-64
#12Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#11)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hello Hitosh, list,

Attached is revised version.

I failed to attached the patch. I'm trying again.

I'm currently unable to test, since on holiday. I'm happy to continue

testing once returned but it may not be within the bounds of the current
commitfest, sorry.

5) Regression tests are missing; I think at this point they'd aid in
speeding up development/test cycles.

I'm still thinking about it. I can add complex test but the concept of
regression test focuses on small pieces of simple cases. I don't want
take pg_regress much more than before.

IMHO, at least one query where the new code is touched is a good idea.

I get the same error with the version from june 9 with current git HEAD.

Fixed. Some variable initializing was wrong.

Thanks - will test when I am back from holiday and other duties.

regards,
Yeb

#13Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#12)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/7/5 Yeb Havinga <yebhavinga@gmail.com>:

Hello Hitosh, list,

Attached is revised version.

I failed to attached the patch. I'm trying again.

I'm currently unable to test, since on holiday. I'm happy to continue
testing once returned but it may not be within the bounds of the current
commitfest, sorry.

Thanks for replying. Regarding the time remained until the end of this
commitfest, I'd think we should mark this item "Returned with
Feedback" if no objection appears. I will be very happy if Yeb tests
my latest patch after he comes back. I'll make up my mind around
regression test.

Regards,

--
Hitoshi Harada

#14Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Hitoshi Harada (#13)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/7/5 Hitoshi Harada <umi.tanuki@gmail.com>:

2011/7/5 Yeb Havinga <yebhavinga@gmail.com>:

Hello Hitosh, list,

Attached is revised version.

I failed to attached the patch. I'm trying again.

I'm currently unable to test, since on holiday. I'm happy to continue
testing once returned but it may not be within the bounds of the current
commitfest, sorry.

Thanks for replying. Regarding the time remained until the end of this
commitfest, I'd think we should mark this item "Returned with
Feedback" if no objection appears. I will be very happy if Yeb tests
my latest patch after he comes back. I'll make up my mind around
regression test.

It seems that Yeb marked "Returned with Feedback". Thanks for the
review again. Still, I'd appreciate if further review is done on my
latest patch.

Thanks,
--
Hitoshi Harada

#15Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#14)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

On 2011-07-09 16:23, Hitoshi Harada wrote:

2011/7/5 Hitoshi Harada<umi.tanuki@gmail.com>:

2011/7/5 Yeb Havinga<yebhavinga@gmail.com>:

Hello Hitosh, list,

Attached is revised version.

I failed to attached the patch. I'm trying again.

I'm currently unable to test, since on holiday. I'm happy to continue
testing once returned but it may not be within the bounds of the current
commitfest, sorry.

Thanks for replying. Regarding the time remained until the end of this
commitfest, I'd think we should mark this item "Returned with
Feedback" if no objection appears. I will be very happy if Yeb tests
my latest patch after he comes back. I'll make up my mind around
regression test.

It seems that Yeb marked "Returned with Feedback". Thanks for the
review again. Still, I'd appreciate if further review is done on my
latest patch.

Yes, I did so after you suggested to return it to feedback. I'll review
your latest patch as soon as there is enough time to do a proper review,
which is probably after next week.

regards,
Yeb Havinga

#16Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#11)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

On 2011-07-02 10:02, Hitoshi Harada wrote:

Although I still need to think about suitable regression test case,
the patch itself can be reviewed again. You may want to try some
additional tests as you imagine after finding my test case gets
quicker.

Hello Hitoshi-san,

I took a look at your latest patch and it looks good, no comments.
However I also tried it against current 9.2 HEAD and the test query of
the start of this thread.

Before and after applying the patch, I get the same result for the test
query.

postgres=# explain select m_id, sum_len from size_m m inner join(select
m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
QUERY PLAN
----------------------------------------------------------------------------------
Nested Loop (cost=79392.64..82938.05 rows=100 width=12)
Join Filter: (m.id = size_l.m_id)
-> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=4)
Filter: (val = '10101'::text)
-> GroupAggregate (cost=79392.64..81592.15 rows=19951 width=277)
-> Sort (cost=79392.64..79892.64 rows=200000 width=277)
Sort Key: size_l.m_id
-> Seq Scan on size_l (cost=0.00..9829.00 rows=200000
width=277)

I double checked that I had applied the patch (git diff shows the
patch), installed and restarted postgres. The database is a fresh
created database with no edits in postgresql.conf.

regards,

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#17Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#16)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/7/22 Yeb Havinga <yebhavinga@gmail.com>:

On 2011-07-02 10:02, Hitoshi Harada wrote:

Although I still need to think about suitable regression test case,
the patch itself can be reviewed again. You may want to try some
additional tests as you imagine after finding my test case gets
quicker.

Hello Hitoshi-san,

Hi,

I double checked that I had applied the patch (git diff shows the patch),
installed and restarted postgres. The database is a fresh created database
with no edits in postgresql.conf.

:(
I updated the patch. Could you try attached once more? The "issafe"
switch seems wrong.

Thanks,
--
Hitoshi Harada

Attachments:

aggjoin-20110722.patchapplication/octet-stream; name=aggjoin-20110722.patchDownload+323-63
#18Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#17)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

On 2011-07-22 16:17, Hitoshi Harada wrote:

:(
I updated the patch. Could you try attached once more? The "issafe"
switch seems wrong.

Works like a charm :-). However, now there is always a copyObject of a
subquery even when the subquery is not safe for qual pushdown. The
problem with the previous issafe was that it was only assigned for
rel->baserestrictinfo != NIL. If it is assigned before the if statement,
it still works. See attached patch that avoids subquery copy for unsafe
subqueries, and also exits best_inner_subqueryscan before palloc of
differenttypes in case of unsafe queries.

regards,
--

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

Attachments:

aggjoin-v2.patchtext/x-patch; name=aggjoin-v2.patchDownload+381-177
#19Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#18)
Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011/7/23 Yeb Havinga <yebhavinga@gmail.com>:

On 2011-07-22 16:17, Hitoshi Harada wrote:

:(
I updated the patch. Could you try attached once more? The "issafe"
switch seems wrong.

Works like a charm :-). However, now there is always a copyObject of a
subquery even when the subquery is not safe for qual pushdown. The problem
with the previous issafe was that it was only assigned for
rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
still works. See attached patch that avoids subquery copy for unsafe
subqueries, and also exits best_inner_subqueryscan before palloc of
differenttypes in case of unsafe queries.

Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
I'll check it more.

--
Hitoshi Harada

#20Yeb Havinga
yebhavinga@gmail.com
In reply to: Hitoshi Harada (#19)
Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

On 2011-07-22 17:35, Hitoshi Harada wrote:

2011/7/23 Yeb Havinga<yebhavinga@gmail.com>:

Works like a charm :-). However, now there is always a copyObject of a
subquery even when the subquery is not safe for qual pushdown. The problem
with the previous issafe was that it was only assigned for
rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
still works. See attached patch that avoids subquery copy for unsafe
subqueries, and also exits best_inner_subqueryscan before palloc of
differenttypes in case of unsafe queries.

Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
I'll check it more.

A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
on PostgreSQL at
http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
which he showed how to manually pull up a dss subquery to get a large
speed up. Initially I thought: cool, this is probably now handled by
Hitoshi's patch, but it turns out the subquery type in the dss query is
different.

The original and rewritten queries are below. The debug_print_plan
output shows the subquery is called from a opexpr (< l_quantity,
subquery output) and the sublink type is EXPR_SUBLINK. Looking at the
source code; pull_up_sublink only considers ANY and EXISTS sublinks. I'm
wondering if this could be expanded to deal with EXPR sublinks. Clearly
in the example Tomas has given this can be done. I'm wondering if there
are show stoppers that prevent this to be possible in the general case,
but can't think of any, other than the case of a sublink returning NULL
and the opexpr is part of a larger OR expression or IS NULL; in which
case it should not be pulled op, or perhaps it could be pulled up as
outer join.

Thoughts?

regards,
Yeb

The original query:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < (
select
0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
)
LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Limit (cost=183345309.79..183345309.81 rows=1 width=8)
-> Aggregate (cost=183345309.79..183345309.81 rows=1 width=8)
-> Hash Join (cost=2839.99..183345307.76 rows=815 width=8)
Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
Join Filter: (public.lineitem.l_quantity < (SubPlan 1))
-> Seq Scan on lineitem (cost=0.00..68985.69
rows=2399869 width=17)
-> Hash (cost=2839.00..2839.00 rows=79 width=4)
-> Seq Scan on part (cost=0.00..2839.00 rows=79
width=4)
Filter: ((p_brand = 'Brand#13'::bpchar) AND
(p_container = 'JUMBO PKG'::bpchar))
SubPlan 1
-> Aggregate (cost=74985.44..74985.46 rows=1 width=5)
-> Seq Scan on lineitem (cost=0.00..74985.36
rows=31 width=5)
Filter: (l_partkey = part.p_partkey)

manually rewritten variant:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(SELECT l_partkey AS agg_partkey,
0.2 * avg(l_quantity) AS avg_quantity
FROM lineitem GROUP BY l_partkey) part_agg
where
p_partkey = l_partkey
and agg_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < avg_quantity
LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=179643.88..179643.89 rows=1 width=8)
-> Aggregate (cost=179643.88..179643.89 rows=1 width=8)
-> Hash Join (cost=161865.21..178853.91 rows=315985 width=8)
Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
Join Filter: (public.lineitem.l_quantity < ((0.2 *
avg(public.lineitem.l_quantity))))
-> HashAggregate (cost=80985.04..82148.65 rows=77574
width=9)
-> Seq Scan on lineitem (cost=0.00..68985.69
rows=2399869 width=9)
-> Hash (cost=80849.63..80849.63 rows=2444 width=21)
-> Hash Join (cost=2839.99..80849.63 rows=2444
width=21)
Hash Cond: (public.lineitem.l_partkey =
part.p_partkey)
-> Seq Scan on lineitem
(cost=0.00..68985.69 rows=2399869 width=17)
-> Hash (cost=2839.00..2839.00 rows=79
width=4)
-> Seq Scan on part
(cost=0.00..2839.00 rows=79 width=4)
Filter: ((p_brand =
'Brand#13'::bpchar) AND (p_container = 'JUMBO PKG'::bpchar))

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yeb Havinga (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#21)
#23Yeb Havinga
yebhavinga@gmail.com
In reply to: Robert Haas (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yeb Havinga (#23)
#25Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Yeb Havinga (#20)
#26Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Tom Lane (#21)