Pull up aggregate subquery

Started by Hitoshi Haradaalmost 15 years ago25 messageshackers
Jump to latest
#1Hitoshi Harada
umi.tanuki@gmail.com

I sometimes wonder if we could pull up aggregate query in the optimizer.

My typical problem is:

Consider two relations, medium size M and large size L. L has
reference column to M's primary key. Say,

create table size_m as select i as id, repeat(i::text, i % 100) as
val from generate_series(1, 20000) i;
create table size_l as select i as id, m.id as m_id,
repeat(i::text, i % 100) as val from generate_series(1, 100000) i,
size_m m where i.i / 10 = m.id;

Now, you want to query M under some condition with join aggregate L
group by M's primary key.

select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '1';

The generated plan is:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=36116.92..38339.67 rows=50 width=235) (actual
time=440.679..1039.964 rows=1 loops=1)
Join Filter: (m.id = size_l.m_id)
-> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=227)
(actual time=0.021..16.698 rows=1 loops=1)
Filter: (val = '1'::text)
-> GroupAggregate (cost=36116.92..37217.09 rows=10026 width=248)
(actual time=440.651..1013.704 rows=10000 loops=1)
-> Sort (cost=36116.92..36366.90 rows=99991 width=248)
(actual time=440.619..593.062 rows=99991 loops=1)
Sort Key: size_l.m_id
Sort Method: external sort Disk: 25736kB
-> Seq Scan on size_l (cost=0.00..4565.91 rows=99991
width=248) (actual time=0.011..138.243 rows=99991 loops=1)
Total runtime: 1044.039 ms
(10 rows)

Note that most of the result of aggregate is discarded on join M,
because M resulted in small output with filter by M.val. If we can
filter M first and filter L by the M's result before running
aggregate, the response may dramatically get faster.

If you filter by M.id instead of M.val the optimizer is smart enough
to push down the condition to L, which is filtered before aggregate.

select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where id = 1;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..5713.02 rows=2 width=235) (actual
time=72.121..82.364 rows=1 loops=1)
-> Seq Scan on size_m m (cost=0.00..897.00 rows=1 width=227)
(actual time=0.028..10.252 rows=1 loops=1)
Filter: (id = 1)
-> GroupAggregate (cost=0.00..4815.98 rows=2 width=248) (actual
time=72.065..72.067 rows=1 loops=1)
-> Seq Scan on size_l (cost=0.00..4815.89 rows=10
width=248) (actual time=0.051..71.968 rows=10 loops=1)
Filter: (m_id = 1)
Total runtime: 82.525 ms
(7 rows)

This seems like the benefit of EquivalentClass. 1 = M.id = L.m_id is
implied and the optimizer adds implicit constant filter L.m_id = 1 in
the plan tree. In contrast, in the former case of M.val = '1' doesn't
imply any condition for L.m_id. That's fair enough.

However, I think we can filter L by L.m_id before aggregate because
L.m_id is of the group clause as well as the join condition and M.id
is unique in M. In such cases, the query can be transform something
like:
GroupAggregate
-> NestLoop (L.m_id = M.id)
-> SeqScan L
-> SeqScan M (filter: M.val = '1')
This transformation can be done by pulling up aggregate Query in
pull_up_subqueries(). Currently the optimizer doesn't pull up any
queries which contains aggregate, but as shown above in some cases we
can do it. Attached is WIP proof of concept patch to do it. I know it
breaks general queries but it transforms as I described above. I
suppose the missing piece is adding condition of "when" to pull up
aggregate. "how" is done in the patch.

db1=# explain select m_id, sum_len from size_m m inner join(select
m_id, sum(length(val)) as sum_len from size_l group by m_id)l on m.id
= l.m_id where val = '1';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=6712.96..6713.16 rows=10 width=471) (actual
time=125.496..125.499 rows=1 loops=1)
-> Sort (cost=6712.96..6712.99 rows=10 width=471) (actual
time=125.228..125.288 rows=10 loops=1)
Sort Key: size_l.m_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.00..6712.80 rows=10 width=471)
(actual time=0.142..125.089 rows=10 loops=1)
Join Filter: (m.id = size_l.m_id)
-> Seq Scan on size_m m (cost=0.00..897.00 rows=1
width=227) (actual time=0.037..8.956 rows=1 loops=1)
Filter: (val = '1'::text)
-> Seq Scan on size_l (cost=0.00..4565.91 rows=99991
width=248) (actual time=0.060..65.910 rows=99991 loops=1)
Total runtime: 125.658 ms
(10 rows)

(You may notice that by the transformation non-group/non-agg
TargetEntry is in the Agg node, but actually there's no check for that
during optimizer and executor. May be thanks to Functional
Dependency.)

Comments?

Regards,

--
Hitoshi Harada

Attachments:

pullup-group.patchtext/x-patch; charset=US-ASCII; name=pullup-group.patchDownload+24-5
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hitoshi Harada (#1)
Re: Pull up aggregate subquery

Hitoshi Harada <umi.tanuki@gmail.com> writes:

I sometimes wonder if we could pull up aggregate query in the optimizer.

I don't have time to look at this right now, but please add to the
upcoming commitfest.

regards, tom lane

#3Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Tom Lane (#2)
Re: Pull up aggregate subquery

2011/5/5 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

I sometimes wonder if we could pull up aggregate query in the optimizer.

I don't have time to look at this right now, but please add to the
upcoming commitfest.

Done.
https://commitfest.postgresql.org/action/patch_view?id=548

I'll work further if I find time.

Regards,

--
Hitoshi Harada

#4Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Hitoshi Harada (#3)
Re: Pull up aggregate subquery

2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>:

https://commitfest.postgresql.org/action/patch_view?id=548

I'll work further if I find time.

After more thought, pulling up aggregate subquery in narrow
conditional cases is quite hard path, especially when the joinrel is
more than 2. It will be hard to check pulling up is safe for other
relations than the target relation.

It was a big shame I missed Tom Lane's session in PGCon, but finding
"Parameterized Scan" in his slides, it occurred to me that it might
help my problem, too. Before hitting the "pull up" idea, I once
thought if it would be possible to push outer Var of join down to
Agg's HAVING, which is transferred to underlying SeqScan's filter.
Resulted in something like:

NestLoop
-> SeqScan M (filter: M.val = '1')
-> GroupAggregate
-> SeqScan M (filter: L.m_id = M.id)

However, currently we don't have such mechanism to push down Var as a
qual to non-NestLoop. Yeah, it could be even now, but we should avoid
N-loop of Agg. We want to scan Agg once, with Param $1 = M.id =
multiple values. Since I didn't attend his session I'm afraid I don't
understand "Parameterized Scan" correctly, but once we've got such
mechanism, one example introduced in Robert Haas's blog[1]http://rhaas.blogspot.com/2010/04/finding-our-way-to-lateral.html (originally
shown by Andrew Gierth[2]http://archives.postgresql.org/pgsql-hackers/2009-09/msg00525.php) and LATERAL maybe.

Do I understand correctly? If so, could someone explain more detail of
how to get Parameterized Scan in the planner?

Regards,

[1]: http://rhaas.blogspot.com/2010/04/finding-our-way-to-lateral.html
[2]: http://archives.postgresql.org/pgsql-hackers/2009-09/msg00525.php

--
Hitoshi Harada

#5Robert Haas
robertmhaas@gmail.com
In reply to: Hitoshi Harada (#4)
Re: Pull up aggregate subquery

On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>:

https://commitfest.postgresql.org/action/patch_view?id=548

I'll work further if I find time.

After more thought, pulling up aggregate subquery in narrow
conditional cases is quite hard path, especially when the joinrel is
more than 2. It will be hard to check pulling up is safe for other
relations than the target relation.

It was a big shame I missed Tom Lane's session in PGCon, but finding
"Parameterized Scan" in his slides, it occurred to me that it might
help my problem, too. Before hitting the "pull up" idea, I once
thought if it would be possible to push outer Var of join down to
Agg's HAVING, which is transferred to underlying SeqScan's filter.
Resulted in something like:

 NestLoop
   -> SeqScan M (filter: M.val = '1')
   -> GroupAggregate
     -> SeqScan M (filter: L.m_id = M.id)

However, currently we don't have such mechanism to push down Var as a
qual to non-NestLoop. Yeah, it could be even now, but we should avoid
N-loop of Agg. We want to scan Agg once, with Param $1 = M.id =
multiple values. Since I didn't attend his session I'm afraid I don't
understand "Parameterized Scan" correctly, but once we've got such
mechanism, one example introduced in Robert Haas's blog[1] (originally
shown by Andrew Gierth[2])  and LATERAL maybe.

Do I understand correctly? If so, could someone explain more detail of
how to get Parameterized Scan in the planner?

I think we're going to need Tom to give the definitive word on this,
but I believe that the current situation is that the executor is
capable of handling a parameterized scan (yeah!) but the planner
doesn't know how to generate them (boo!). This is an improvement of
a sort over the 9.0 code base, where neither the planner nor the
executor could handle this case, but we need planner to support in
order to get anywhere useful with it.

The problem is how to figure out whether a parameterized scan is a win
without expending too much planning time. For example, in the case
you mention:

select m_id, sum_len from size_m m inner join (select m_id,
sum(length(val)) as sum_len from size_l group by m_id) l on m.id =
l.m_id where val = '1';

...we'd need to plan the subquery twice, once with a parameterized
qual m_id = $1 pushed down, and once without that. We could then
compare the cost of a nest-loop with the qual to the cost of a merge
or hash join without it. But this seems very expensive. In the
particular case you have here, the subquery is simple enough that this
probably wouldn't be any big deal, but in general there's no reason
why that subquery couldn't be quite complex - or why it couldn't have
subqueries of its own that would requite the same treatment
recursively.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: Pull up aggregate subquery

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

Do I understand correctly? If so, could someone explain more detail of
how to get Parameterized Scan in the planner?

I think we're going to need Tom to give the definitive word on this,
but I believe that the current situation is that the executor is
capable of handling a parameterized scan (yeah!) but the planner
doesn't know how to generate them (boo!). This is an improvement of
a sort over the 9.0 code base, where neither the planner nor the
executor could handle this case, but we need planner to support in
order to get anywhere useful with it.

Yeah. I fixed the executor in 9.1, but got hung up on how to fix the
planner. So the planner still only knows how to do this for the case of
an inner indexscan, ie, it can't handle parameterizing any piece of a
plan larger than a single indexscan. I have some ideas about fixing
that but have had no time to pursue them since December :-(

...we'd need to plan the subquery twice, once with a parameterized
qual m_id = $1 pushed down, and once without that. We could then
compare the cost of a nest-loop with the qual to the cost of a merge
or hash join without it. But this seems very expensive. In the
particular case you have here, the subquery is simple enough that this
probably wouldn't be any big deal, but in general there's no reason
why that subquery couldn't be quite complex - or why it couldn't have
subqueries of its own that would requite the same treatment
recursively.

Yeah. For simple scan/join queries it seems likely that we only care
about parameterizing indexscans, since the main opportunity for a win is
to not scan all of a large table. Restricting things that way would
help reduce the number of extra Paths to carry around. But I'm not sure
whether the same argument can be made for arbitrary subqueries.

regards, tom lane

#7Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Robert Haas (#5)
Re: Pull up aggregate subquery

2011/5/24 Robert Haas <robertmhaas@gmail.com>:

On Sat, May 21, 2011 at 12:49 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

2011/5/5 Hitoshi Harada <umi.tanuki@gmail.com>:
Do I understand correctly? If so, could someone explain more detail of
how to get Parameterized Scan in the planner?

I think we're going to need Tom to give the definitive word on this,
but I believe that the current situation is that the executor is
capable of handling a parameterized scan (yeah!)

Ah, greping git log pointed me to
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=53e757689ce94520f1c53a89dbaa14ea57b09da7

The problem is how to figure out whether a parameterized scan is a win
without expending too much planning time.  For example, in the case
you mention:

select m_id, sum_len from size_m m inner join (select m_id,
sum(length(val)) as sum_len from size_l group by m_id) l on m.id =
l.m_id where val = '1';

...we'd need to plan the subquery twice, once with a parameterized
qual m_id = $1 pushed down, and once without that.  We could then
compare the cost of a nest-loop with the qual to the cost of a merge
or hash join without it.  But this seems very expensive.  In the
particular case you have here, the subquery is simple enough that this
probably wouldn't be any big deal, but in general there's no reason
why that subquery couldn't be quite complex - or why it couldn't have
subqueries of its own that would requite the same treatment
recursively.

That's true. But if the planning cost is an only issue, why not adding
new GUC for user to choose if they prefer it or not? Of course if we
have some method to predict which way to go before proving both ways,
it's great. Do you have some blue picture on it?

In addition, I wonder if the "generalized nestloop" pattern can do
whole outer scan before proving inner? I mean, in my example if the
outer scan qualified more than one tuple I'm afraid that inner Agg and
its underlying SeqScan are re-scanned more than one. That will bloat
performance if the Agg is expensive. It would be great if we can outer
scan can do scan to the end and stores param variables in something
like array and prove once inner Agg, so that we can ensure the Agg is
computed once. To do that, do we need more work on executor?

Regards,

--
Hitoshi Harada

#8Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Tom Lane (#6)
Re: Pull up aggregate subquery

2011/5/24 Tom Lane <tgl@sss.pgh.pa.us>:

Robert Haas <robertmhaas@gmail.com> writes:
Yeah.  I fixed the executor in 9.1, but got hung up on how to fix the
planner.  So the planner still only knows how to do this for the case of
an inner indexscan, ie, it can't handle parameterizing any piece of a
plan larger than a single indexscan.  I have some ideas about fixing
that but have had no time to pursue them since December :-(

If you can afford, could you describe the ideas? Or I'm glad if you
tell where to start looking. Anyway I'm going to dig into optimizer
more by myself.

But I'm not sure
whether the same argument can be made for arbitrary subqueries.

I guess it's hard to apply the mechanism to arbitrary subqueries, but
there are many use cases like my example, Agg node in inner.

Regards,

--
Hitoshi Harada

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: Pull up aggregate subquery

On Mon, May 23, 2011 at 4:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

...we'd need to plan the subquery twice, once with a parameterized
qual m_id = $1 pushed down, and once without that.  We could then
compare the cost of a nest-loop with the qual to the cost of a merge
or hash join without it.  But this seems very expensive.  In the
particular case you have here, the subquery is simple enough that this
probably wouldn't be any big deal, but in general there's no reason
why that subquery couldn't be quite complex - or why it couldn't have
subqueries of its own that would requite the same treatment
recursively.

Yeah.  For simple scan/join queries it seems likely that we only care
about parameterizing indexscans, since the main opportunity for a win is
to not scan all of a large table.  Restricting things that way would
help reduce the number of extra Paths to carry around.  But I'm not sure
whether the same argument can be made for arbitrary subqueries.

I must be misunderstanding you, because index scans are the thing we
already *do* parameterize; and what else would make any sense?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Robert Haas
robertmhaas@gmail.com
In reply to: Hitoshi Harada (#7)
Re: Pull up aggregate subquery

On Mon, May 23, 2011 at 10:47 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

That's true. But if the planning cost is an only issue, why not adding
new GUC for user to choose if they prefer it or not? Of course if we
have some method to predict which way to go before proving both ways,
it's great. Do you have some blue picture on it?

I don't really like the idea of adding a GUC for this, unless we
convince ourselves that nothing else is sensible. I mean, that leads
to conversations like this:

Newbie: My query is slow.
Hacker: Turn on enable_magic_pixie_dust and it'll get better.
Newbie: Oh, yeah, that is better. Why isn't this turned on by default, anyway?
Hacker: Well, on pathological queries, it makes planning take way too
long, so we think it's not really safe to enable it by default.
Newbie: Wait... I thought you just told me to enable it. It's not safe?
Hacker: Well, sort of. I mean, uh... hey, look, an elephant!

I think that if you're interested in working on this, the first step
would be to get it to work at all, even if it's a bit inefficient.
Even though in theory you could get exponential planning time growth
if you have an aggregate inside of an aggregate inside of an aggregate
inside of an aggregate, very few people are going to do that. If it
turns out to be a problem, we can dream up some suitable defense; as I
think about it a little more, I don't think that would be terribly
difficult to work out. The hard part is probably to get this to work
in the first place.

In addition, I wonder if the "generalized nestloop" pattern can do
whole outer scan before proving inner? I mean, in my example if the
outer scan qualified more than one tuple I'm afraid that inner Agg and
its underlying SeqScan are re-scanned more than one. That will bloat
performance if the Agg is expensive. It would be great if we can outer
scan can do scan to the end and stores param variables in something
like array and prove once inner Agg, so that we can ensure the Agg is
computed once. To do that, do we need more work on executor?

That sounds to me like way too much complexity for the first
go-around. We should either push the agg down or not; it sounds like
you're proposing some kind of half-way state. That might not be
useful in practice, and it will certainly be a lot more work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: Pull up aggregate subquery

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, May 23, 2011 at 4:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. For simple scan/join queries it seems likely that we only care
about parameterizing indexscans, since the main opportunity for a win is
to not scan all of a large table. Restricting things that way would
help reduce the number of extra Paths to carry around. But I'm not sure
whether the same argument can be made for arbitrary subqueries.

I must be misunderstanding you, because index scans are the thing we
already *do* parameterize; and what else would make any sense?

The point I was trying to make is that the ultimate reason for having a
parameterized portion-of-a-plan will be that there's a parameterized
indexscan somewhere down at the bottom. I had originally imagined that
we might parameterize any old scan; for example consider replacing

Nestloop
Join Filter: a.x = b.y
-> Seqscan on a
-> Seqscan on b

with

Nestloop
-> Seqscan on a
-> Seqscan on b
Filter: b.y = a.x

Although this isn't nearly as useful as if we could be using an index on
b.y, there would still be some marginal gain to be had, because we'd be
able to reject rows of b without first passing them up to the join node.
But I'm afraid that going all-out like that would slow the planner down
far too much (too many Paths to consider) to be justified by a marginal
runtime gain.

So the idea I have at the moment is that we'll still only parameterize
indexscans, but then allow those to be joined to unrelated relations
while still remaining parameterized. That should reduce the number
of parameterized Paths hanging around, because only joinclauses that
match indexes will give rise to such Paths.

But I think this is all fairly unrelated to the case that Hitoshi is on
about. As you said earlier, it seems like we'd have to derive both
parameterized and unparameterized plans for the subquery, which seems
mighty expensive.

regards, tom lane

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: Pull up aggregate subquery

On Tue, May 24, 2011 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I must be misunderstanding you, because index scans are the thing we
already *do* parameterize; and what else would make any sense?

The point I was trying to make is that the ultimate reason for having a
parameterized portion-of-a-plan will be that there's a parameterized
indexscan somewhere down at the bottom.  I had originally imagined that
we might parameterize any old scan; for example consider replacing

       Nestloop
         Join Filter: a.x = b.y
         -> Seqscan on a
         -> Seqscan on b

with

       Nestloop
         -> Seqscan on a
         -> Seqscan on b
              Filter: b.y = a.x

Oh, I see. I have a general gripe with nested loop plans: we already
consider too many of them. IIRC, when I last fooled around with this,
the number of nested loop paths that we generate far exceeded the
number of merge or hash join paths, and most of those paths suck and
are a complete waste of time. It strikes me that we ought to be
trying to find ways to get rid of some of the paths we're already
considering, rather than adding any more. In this particular case, if
the second plan is actually faster, it probably won't be by much; we
could think about trying to make some kind of ex-post-facto
transformation instead of throwing everything into the path machinery.

Although this isn't nearly as useful as if we could be using an index on
b.y, there would still be some marginal gain to be had, because we'd be
able to reject rows of b without first passing them up to the join node.
But I'm afraid that going all-out like that would slow the planner down
far too much (too many Paths to consider) to be justified by a marginal
runtime gain.

Agreed.

So the idea I have at the moment is that we'll still only parameterize
indexscans, but then allow those to be joined to unrelated relations
while still remaining parameterized.  That should reduce the number
of parameterized Paths hanging around, because only joinclauses that
match indexes will give rise to such Paths.

That seems fine, yeah. If anything, we might want to limit it even
more, but certainly that's a good place to start, and see how it
shakes out.

But I think this is all fairly unrelated to the case that Hitoshi is on
about.  As you said earlier, it seems like we'd have to derive both
parameterized and unparameterized plans for the subquery, which seems
mighty expensive.

That was my first thought, too, but then I wondered if I was getting
cheap. Most of the time, the subquery will be something simple, and
replanning it twice won't really matter much. If it happens to be
something complicated, then it will take longer, but on the other hand
that's exactly the sort of byzantine query where you probably want the
planner to pull out all the stops. Aggregates tend to "feel" slow
almost invariably, because the amount of data being processed under
the hood is much larger than what actually gets emitted, so I think we
should at least consider the possibility that users really won't care
about a bit of extra work. The case I'm concerned about is where you
have several levels of nested aggregates, and the effect starts to
multiply. But even if that turns out to be a problem, we could handle
it by limiting consideration of the alternate path to the top 1 or 2
levels and handle the rest as we do now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: Pull up aggregate subquery

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, May 24, 2011 at 11:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The point I was trying to make is that the ultimate reason for having a
parameterized portion-of-a-plan will be that there's a parameterized
indexscan somewhere down at the bottom.

Oh, I see. I have a general gripe with nested loop plans: we already
consider too many of them. IIRC, when I last fooled around with this,
the number of nested loop paths that we generate far exceeded the
number of merge or hash join paths, and most of those paths suck and
are a complete waste of time.

Hm, really? My experience is that it's the mergejoin paths that breed
like rabbits, because there are so many potential sort orders.

But I think this is all fairly unrelated to the case that Hitoshi is on
about. As you said earlier, it seems like we'd have to derive both
parameterized and unparameterized plans for the subquery, which seems
mighty expensive.

That was my first thought, too, but then I wondered if I was getting
cheap.

Yeah, it's certainly possible that we're worrying too much. Usually
I only get concerned about added planner logic if it will impact the
planning time for simple queries. "Simple" tends to be in the eye of
the beholder, but something with a complicated aggregate subquery is
probably not simple by anyone's definition.

In this case the sticky point is that there could be multiple possible
sets of clauses available to be pushed down, depending on what you
assume is the outer relation for the eventual upper-level nestloop.
So worst case, you could have not just one parameterized plan to
generate in addition to the regular kind, but 2^N of them ...

regards, tom lane

#14Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Tom Lane (#13)
Re: Pull up aggregate subquery

2011/5/25 Tom Lane <tgl@sss.pgh.pa.us>:

Robert Haas <robertmhaas@gmail.com> writes:

That was my first thought, too, but then I wondered if I was getting
cheap.

Yeah, it's certainly possible that we're worrying too much.  Usually
I only get concerned about added planner logic if it will impact the
planning time for simple queries.  "Simple" tends to be in the eye of
the beholder, but something with a complicated aggregate subquery is
probably not simple by anyone's definition.

In this case the sticky point is that there could be multiple possible
sets of clauses available to be pushed down, depending on what you
assume is the outer relation for the eventual upper-level nestloop.
So worst case, you could have not just one parameterized plan to
generate in addition to the regular kind, but 2^N of them ...

My intention is that if join qual matches subqury Agg's grouping keys
then the Var can be pushed down, so I'm not worried about the
exponential possibilities of paths growth.

And I found the right place to hack, where set_subquery_pathlist()
pushes down some baseristrictinfo. We don't have Var in the
RestrictInfo now, but I guess we can put them in it somehow before
reaching there.

Even if I can do it, the effective case is only outer is only one
tuple case. As I noted earlier this optimization will complete by
executor's cooperation, which is something like
gather-param-values-as-array before starting Agg scan. So I'm still
thinking which of pulling up and parameterized scan is better.

Regards,

--
Hitoshi Harada

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: Pull up aggregate subquery

On Tue, May 24, 2011 at 12:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh, I see.  I have a general gripe with nested loop plans: we already
consider too many of them.  IIRC, when I last fooled around with this,
the number of nested loop paths that we generate far exceeded the
number of merge or hash join paths, and most of those paths suck and
are a complete waste of time.

Hm, really?  My experience is that it's the mergejoin paths that breed
like rabbits, because there are so many potential sort orders.

*scratches head*

Well, I'm pretty sure that's how it looked when I was testing it. I
wonder how this could be different for the two of us. Or maybe one of
us is confused. Admittedly, I haven't looked at it in a while.

But I think this is all fairly unrelated to the case that Hitoshi is on
about.  As you said earlier, it seems like we'd have to derive both
parameterized and unparameterized plans for the subquery, which seems
mighty expensive.

That was my first thought, too, but then I wondered if I was getting
cheap.

Yeah, it's certainly possible that we're worrying too much.  Usually
I only get concerned about added planner logic if it will impact the
planning time for simple queries.  "Simple" tends to be in the eye of
the beholder, but something with a complicated aggregate subquery is
probably not simple by anyone's definition.

In this case the sticky point is that there could be multiple possible
sets of clauses available to be pushed down, depending on what you
assume is the outer relation for the eventual upper-level nestloop.
So worst case, you could have not just one parameterized plan to
generate in addition to the regular kind, but 2^N of them ...

Hmm. Well, 2^N is more than 2. But I bet most of them are boring.
Judging by his followup email, Hitoshi Harada seems to think we can
just look at the case where we can parameterize on all of the grouping
columns. The only other case that seems like it might be interesting
is parameterizing on any single one of the grouping columns. I can't
get excited about pushing down arbitrary subsets. Of course, even
O(N) in the number of grouping columns might be too much, but then we
could fall back to just all or nothing. I think the "all" case by
itself would probably extract 90%+ of the benefit, especially since
"all" will often mean "the only one there is".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Hitoshi Harada (#14)
Re: Pull up aggregate subquery

2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:

So I'm still
thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
but it seems to me like it is slightly easier than pulling up
agg-subquery. The main point is that when you want to pull up, you
must care if the final output would be correct with other join
relations than the aggregate-related one. In contrast, when pushing
down the target relation to agg-subquery it is simple to ensure the
result.

I'm looking around pull_up_subqueries() in subquery_planner() to add
the pushing down logic. It could be possible to do it around
make_one_rel() but I bet query structure changes are doable against
Query, not PlannerInfo.

I appreciate any comments.

Regards,

--
Hitoshi Harada

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hitoshi Harada (#16)
Re: Pull up aggregate subquery

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:

So I'm still
thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,

Yes, it sure does. Won't that typically change the number of rows
produced by the subquery's jointree? And thus possibly change the
aggregate's result?

regards, tom lane

#18Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Tom Lane (#17)
Re: Pull up aggregate subquery

2011/5/26 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:

So I'm still
thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,

Yes, it sure does.  Won't that typically change the number of rows
produced by the subquery's jointree?  And thus possibly change the
aggregate's result?

No, because it will be restricted to assumption that join qual ==
grouping key and outer relation is unique on the var in that join
qual. Pushing down the baserel to sub-aggregate is equivalent to
pulling up sub-aggregate eventually if there are only two relations
(which is my first example.) The difference is if the optimizer needs
to care about other relations which may change the number of rows
produced.

Regards,

--
Hitoshi Harada

#19Robert Haas
robertmhaas@gmail.com
In reply to: Hitoshi Harada (#16)
Re: Pull up aggregate subquery

On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

2011/5/25 Hitoshi Harada <umi.tanuki@gmail.com>:

So I'm still
thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
but it seems to me like it is slightly easier than pulling up
agg-subquery. The main point is that when you want to pull up, you
must care if the final output would be correct with other join
relations than the aggregate-related one. In contrast, when pushing
down the target relation to agg-subquery it is simple to ensure the
result.

I'm looking around pull_up_subqueries() in subquery_planner() to add
the pushing down logic. It could be possible to do it around
make_one_rel() but I bet query structure changes are doable against
Query, not PlannerInfo.

How do you decide whether or not to push down?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Robert Haas (#10)
Re: Pull up aggregate subquery

On Mon, May 23, 2011 at 11:08:40PM -0400, Robert Haas wrote:

I don't really like the idea of adding a GUC for this, unless we
convince ourselves that nothing else is sensible. I mean, that leads
to conversations like this:

Newbie: My query is slow.
Hacker: Turn on enable_magic_pixie_dust and it'll get better.
Newbie: Oh, yeah, that is better. Why isn't this turned on by default, anyway?
Hacker: Well, on pathological queries, it makes planning take way too
long, so we think it's not really safe to enable it by default.
Newbie: Wait... I thought you just told me to enable it. It's not safe?
Hacker: Well, sort of. I mean, uh... hey, look, an elephant!

ROTFL! This needs to go on the wiki somewhere discussing why HACKERs
rejects tunable knobs as often as happens.

Ross
--
Ross Reedstrom, Ph.D. reedstrm@rice.edu
Systems Engineer & Admin, Research Scientist phone: 713-348-6166
Connexions http://cnx.org fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE

#21Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Robert Haas (#19)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Hitoshi Harada (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Hitoshi Harada (#7)
#25Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Simon Riggs (#24)