Aggregate Supporting Functions

Started by David Rowleyover 10 years ago6 messages

David Rowley

david.rowley@2ndquadrant.com

over 10 years ago

I believe this is an idea that's been discussed before, but I'm not exactly
sure where that happened:

Overview:

The idea is that we skip a major chunk of processing in situations like:

SELECT avg(x),sum(x),count(x) FROM bigtable;

Because avg(x) already technically knows what the values of sum(x) and
count(x) are.

The performance improvement of this particular case is as follows:

create table bigtable as select
generate_series(1,1000000)::numeric as x; vacuum bigtable;

SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1

Time: 390.325 ms
Time: 392.297 ms
Time: 400.790 ms

SELECT avg(x) FROM bigtable; -- Query 2

Time: 219.700 ms
Time: 215.285 ms
Time: 233.691 ms

With the implementation I'm proposing below, I believe that query 1 should
perform almost as well as query 2. The only extra CPU work that would be
done would be some extra checks during planning, and the calling of 2
simple new final functions which will extract the count(x) and sum(x) from
the avg transition's state.

Purpose of this Email:

For technical review of proposed implementation.

Implementation:

1. Add a new boolean column pg_aggregate named hassuppagg which will be set
to true if the aggregate supports other aggregates. For example avg(int)
will support count(int) and sum(int)
2. Add new system table named pg_aggregate_support (Or some better shorter
name)

This system table will be defined as follows:
aspfnoid regproc,
aspfnsupported regproc,
aspfinalfn regproc,
primary key (aspfnoid, aspfnsupported)

Where in the above example aspfnoid will be avg(int) and 2 rows will exist.
1 with count(int) in aspfnsupported, and one with sum(int) in the
aspfnsupported column. aspfinalfn will be a new final function which
extracts the required portion of the avg's aggregate state.

3. Add logic in the planner to look for look for supporting cases. With
logic something along the lines of:

a. Does the query have any aggregates? If not -> return;
b. Does the query have more than 1 aggregate? If not -> return;
c. Does the at least one of the aggregates have hassuppagg set to true?
if not -> return;
d. Analyze aggregates to eliminate aggregates that are covered by another
aggregate. We should use the aggregate which eliminates the most other
aggregates*

* For example stddev(x) will support avg(x), sum(x) and count(x) so a query
such as select stddev(x), avg(x), sum(x), count(x) will eliminate avg(x),
sum(x), count(x) as stddev(x) supports 3, avg(x) only supports 2, so will
have to be eliminated.

Concerns:

I'm a little bit concerned that someone will one day report that:

SELECT avg(x), sum(x), count(x) from bigtable;

Is faster than:

SELECT sum(x), count(x) from bigtable;

Of course, this will be just because we've made case 1 faster, NOT because
we've slowed down case 2.
I can't immediately think of a way to fix that without risking slowing
down: select count(x) from bigtable;

CREATE AGGREGATE Syntax:

To allow users to implement aggregates which take advantage of this we'd
better also expand the CREATE AGGREGATE syntax.

I've not given this a huge amount of thought. The only thing I've come up
with so far is;

CREATE AGGREGATE avg(bigint)
(FINALFUNC = avgfinal)
SUPPORTS (count(bigint) = int8_avg_countfn, sum(bigint) = int8_avg_sumfn);

Can anyone think of anything that I've not accounted for before I go off
and work on this?

Regards

David Rowley

"The research leading to these results has received funding from the
European Union’s
Seventh Framework Programme (FP7/2007-2015) under grant agreement n° 318633"

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: David Rowley (#1)

Re: Aggregate Supporting Functions

David Rowley <david.rowley@2ndquadrant.com> wrote:

The idea is that we skip a major chunk of processing in
situations like:

SELECT avg(x),sum(x),count(x) FROM bigtable;

Because avg(x) already technically knows what the values of
sum(x) and count(x) are.

That has occurred to me as a possible optimization, but I hadn't
gone so far as to try to determine what the possible performance
improvement would be.

The performance improvement of this particular case is as
follows:

create table bigtable as select
generate_series(1,1000000)::numeric as x; vacuum bigtable;

SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1

Time: 390.325 ms
Time: 392.297 ms
Time: 400.790 ms

SELECT avg(x) FROM bigtable; -- Query 2

Time: 219.700 ms
Time: 215.285 ms
Time: 233.691 ms

With the implementation I'm proposing below, I believe that
query 1 should perform almost as well as query 2. The only extra
CPU work that would be done would be some extra checks during
planning, and the calling of 2 simple new final functions which
will extract the count(x) and sum(x) from the avg transition's
state.

I agree that if the increase in planning time is negligible, the
performance to be gained in this example is a reduction of about
45% from the original run time. That certainly seems to make it
worth considering.

Implementation:

1. Add a new boolean column pg_aggregate named hassuppagg which
will be set to true if the aggregate supports other aggregates.
For example avg(int) will support count(int) and sum(int)
2. Add new system table named pg_aggregate_support (Or some
better shorter name)

This system table will be defined as follows:
aspfnoid regproc,
aspfnsupported regproc,
aspfinalfn regproc,
primary key (aspfnoid, aspfnsupported)

Where in the above example aspfnoid will be avg(int) and 2 rows
will exist. 1 with count(int) in aspfnsupported, and one with
sum(int) in the aspfnsupported column. aspfinalfn will be a new
final function which extracts the required portion of the avg's
aggregate state.

The caffeine hasn't had a chance to fully kick in yet, but that
seems like enough information to optimize out the count() and sum()
aggregations.

3. Add logic in the planner to look for look for supporting
cases. With logic something along the lines of:

a. Does the query have any aggregates? If not -> return;
b. Does the query have more than 1 aggregate? If not -> return;
c. Does the at least one of the aggregates have hassuppagg set
to true? if not -> return;
d. Analyze aggregates to eliminate aggregates that are covered
by another aggregate. We should use the aggregate which
eliminates the most other aggregates*

* For example stddev(x) will support avg(x), sum(x) and count(x)
so a query such as select stddev(x), avg(x), sum(x), count(x)
will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
avg(x) only supports 2, so will have to be eliminated.

Are you assuming that "x" must match exactly among the aggregates
to allow coverage? Have you given any thought to whether ties are
possible and how they should be resolved?

I'm a little bit concerned that someone will one day report that:

SELECT avg(x), sum(x), count(x) from bigtable;

Is faster than:

SELECT sum(x), count(x) from bigtable;

Of course, this will be just because we've made case 1 faster,
NOT because we've slowed down case 2.

Yeah, that seems possible (if not with these functions, at least
technically possible with *some* functions), and hard to explain to
a novice.

I can't immediately think of a way to fix that without risking
slowing down: select count(x) from bigtable;

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query. That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit. It seems
premature to optimize for that before having the rest working.

To allow users to implement aggregates which take advantage of
this we'd better also expand the CREATE AGGREGATE syntax.

I've not given this a huge amount of thought. The only thing
I've come up with so far is;

CREATE AGGREGATE avg(bigint)
(FINALFUNC = avgfinal)
SUPPORTS (count(bigint) = int8_avg_countfn,
sum(bigint) = int8_avg_sumfn);

I would try to find an existing keyword which could be used instead
of adding SUPPORTS to the list. On a quick look over the keyword
list EXTRACT jumped out at me as a candidate. There should be
something usable in there somewhere.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Kevin Grittner (#2)

Re: Aggregate Supporting Functions

Kevin Grittner <kgrittn@ymail.com> writes:

David Rowley <david.rowley@2ndquadrant.com> wrote:

[ avoid duplicate calculations for related aggregates ]

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query. That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit. It seems
premature to optimize for that before having the rest working.

Actually, I would suggest that you forget about all the other aspects
and *just* do that, because it could be made to work today on existing
aggregate functions, and it would not require hundreds-to-thousands
of lines of boilerplate support code in the grammar, catalog support,
pg_dump, yadda yadda. That is, look to see which aggregates use the
same transition function and run that just once. We already have the
rule that the final function can't destroy the transition output,
so running two different final functions on the same transition result
should Just Work.

The rest of what David is thinking about could be done in a followon
version by allowing the same aggregate to be implemented by any of several
transition-function/final-function pairs, then teaching the planner to
prefer pairs that let the same transition function be used for several
aggregates. But I'd see that as a later refinement that might well fail
the bang-for-buck test, and hence shouldn't be the first step.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Tom Lane (#3)

Re: Aggregate Supporting Functions

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kevin Grittner <kgrittn@ymail.com> writes:

David Rowley <david.rowley@2ndquadrant.com> wrote:

[ avoid duplicate calculations for related aggregates ]

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query. That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit. It seems
premature to optimize for that before having the rest working.

Actually, I would suggest that you forget about all the other aspects
and *just* do that, because it could be made to work today on existing
aggregate functions, and it would not require hundreds-to-thousands
of lines of boilerplate support code in the grammar, catalog support,
pg_dump, yadda yadda. That is, look to see which aggregates use the
same transition function and run that just once.

I was responding to David's suggestion that this particular query
be optimized by using the transition function from avg():

SELECT sum(x), count(x) from bigtable;

Reviewing what you said caused me to notice something that I had
missed before -- that sum() and avg() share a transition function.
Still, that function is not used for count(), so I don't see how
that fits into what you're saying above.

I agree that what you're suggesting does allow access to some very
low-hanging fruit that I had not noticed; it makes sense to get
that first.

The rest of what David is thinking about could be done in a followon
version by allowing the same aggregate to be implemented by any of several
transition-function/final-function pairs, then teaching the planner to
prefer pairs that let the same transition function be used for several
aggregates. But I'd see that as a later refinement that might well fail
the bang-for-buck test, and hence shouldn't be the first step.

Well, that part will be a little tricky, because it would be
infrastructure which would allow what are likely significant
optimizations in several other features. There's a bit of a
chicken-and-egg problem, because these other optimizations can't be
written without the infrastructure, and the infrastructure will not
show its full worth until the other optimizations are able to take
advantage of it. But we can cross that bridge when we come to it.

This doesn't look to me like the traditional 80/20 rule. I think
the easy stuff might be 5% of the benefit for 1% of the work; but
still a better bang for the buck than the other work.

What you are proposing as an alternative to what David proposed for
the later work looks (on the face of it) like a bigger, more
complicated mechanism than he proposed, but more flexible if it can
be made to work. What I'd hate to see is failure to get a toaster
because it's too expensive to get one that also bakes pizza. We're
gonna want to make a lot of toast over the next few years.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Rowley

david.rowley@2ndquadrant.com

over 10 years ago

In reply to: Kevin Grittner (#2)

Re: Aggregate Supporting Functions

On 10 June 2015 at 01:53, Kevin Grittner <kgrittn@ymail.com> wrote:

David Rowley <david.rowley@2ndquadrant.com> wrote:

3. Add logic in the planner to look for look for supporting
cases. With logic something along the lines of:

a. Does the query have any aggregates? If not -> return;
b. Does the query have more than 1 aggregate? If not -> return;
c. Does the at least one of the aggregates have hassuppagg set
to true? if not -> return;
d. Analyze aggregates to eliminate aggregates that are covered
by another aggregate. We should use the aggregate which
eliminates the most other aggregates*

* For example stddev(x) will support avg(x), sum(x) and count(x)
so a query such as select stddev(x), avg(x), sum(x), count(x)
will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
avg(x) only supports 2, so will have to be eliminated.

Are you assuming that "x" must match exactly among the aggregates
to allow coverage?

In my haste I neglected to mention that critical part :) Yeah the
expression within the aggregation function must match exactly.
This would mean that count(*) would not optimise but count(x) would. I
believe that's an ok restriction on a first cut implementation, as that
rewrite requires some more NULLability analysis.

Have you given any thought to whether ties are

possible and how they should be resolved?

I guess ties are possible, although I can't immediately think of which of
the standard aggs could cause that. I've not thought on it a great deal to
be honest. I'm no too sure if pg_proc.procost is better to check or if
pg_aggregate.aggtransspace would be better. I see procost is not very well
set for quite a lot of functions, e.g float4pl and numeric_sum currently
both have the cost of 1, which is certainly not the case. So perhaps it
would be easier to just use aggtransspace, and if there's still a tie then
just prefer the one that comes first in the targetlist. Likely that would
be good as it keeps it predictable and allows the users to have influence
on the decision.

I'm a little bit concerned that someone will one day report that:

SELECT avg(x), sum(x), count(x) from bigtable;

Is faster than:

SELECT sum(x), count(x) from bigtable;

Of course, this will be just because we've made case 1 faster,
NOT because we've slowed down case 2.

Yeah, that seems possible (if not with these functions, at least
technically possible with *some* functions), and hard to explain to
a novice.

I can't immediately think of a way to fix that without risking
slowing down: select count(x) from bigtable;

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query. That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit. It seems
premature to optimize for that before having the rest working.

I see your point, and in my example I think your idea works well, but
there's a risk of slowing things down here too for other cases.

With the "supporting aggregates" we're just listing the aggregates we
happen to calculate as a byproduct. Using their value is about as cheap as
calling the final function, but with this, if we decided to use avg(x)
instead of sum(x) and count(x) we really have no way to understand what
else avg(x) might be doing. For example, if we pretend avg() does not
exist, and we have stddev(x), sum(x) and count(x) as aggregates, given the
query "select sum(x), count(x) from t"... stddev(x) gives us count(x) and
sum(x) as a byproduct, but stddev also calculates sum(x*x), which quite
likely ends up slower than just doing sum(x) and count(x) like normal.
Perhaps the code that looks for these patterns could take the aggregate
that supports the smallest number of supporting aggregates that's enough to
cover the requirements, but still...

We could perhaps add some smarts that analyses the costs of the transition
function calls here which goes to ensure that stuff won't actually happen.
Maybe that's phase 3 though. I don't think I want to touch it at this
stage, but for sure something to remember for the future!

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

David Rowley

david.rowley@2ndquadrant.com

over 10 years ago

In reply to: Tom Lane (#3)

Re: Aggregate Supporting Functions

On 10 June 2015 at 02:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kevin Grittner <kgrittn@ymail.com> writes:

David Rowley <david.rowley@2ndquadrant.com> wrote:

[ avoid duplicate calculations for related aggregates ]

From the information you have proposed storing, with cost factors
associated with the functions, it seems technically possible to
infer that you could run (for example) the avg() aggregate to
accumulate both but only run the final functions of the aggregates
referenced by the query. That seems like an optimization to try
hard to forget about until you have at least one real-world use
case where it would yield a significant benefit. It seems
premature to optimize for that before having the rest working.

Actually, I would suggest that you forget about all the other aspects
and *just* do that, because it could be made to work today on existing
aggregate functions, and it would not require hundreds-to-thousands
of lines of boilerplate support code in the grammar, catalog support,
pg_dump, yadda yadda. That is, look to see which aggregates use the
same transition function and run that just once. We already have the
rule that the final function can't destroy the transition output,
so running two different final functions on the same transition result
should Just Work.

Good idea.
I believe the only extra check, besides do they use the same transfn, would
be the initvalue of the aggregate.

I'll write a patch and post in the next couple of days.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services