Functional dependency in GROUP BY through JOINs

Started by David Rowleyabout 13 years ago9 messages
#1David Rowley
dgrowleyml@gmail.com
1 attachment(s)

Some idle thoughts has provoked me to sending this email. I'm wondering what
I'm seeing here, is a simple missed optimisation or something that would be
very difficult to implement, or even perhaps something I've completely
misunderstood.

To set the scene let's go with the old products and sales example. In this
example each product has been given a unique ID maybe due to perhaps
product_codes being a bit long and to keep the sales table as narrow as
possible.

If we wanted to see the sales per product we could write something like
this:

SELECT p.product_code,SUM(s.quantity)

FROM products p

INNER JOIN bigsalestable s ON p.productid = s.productid

GROUP BY p.product_code;

Though this plan might not be quite as optimal as it could be as it performs
the grouping after the join.

QUERY PLAN

----------------------------------------------------------------------------
---------

HashAggregate (cost=33195.13..33199.63 rows=450 width=150)

-> Hash Join (cost=20.13..28195.13 rows=1000000 width=150)

Hash Cond: (s.productid = p.productid)

-> Seq Scan on bigsalestable s (cost=0.00..14425.00 rows=1000000
width=8)

-> Hash (cost=14.50..14.50 rows=450 width=150)

-> Seq Scan on products p (cost=0.00..14.50 rows=450
width=150)

(6 rows)

Of course the query could have been written in the first place as:

SELECT p.product_code,s.quantity

FROM products AS p

INNER JOIN (SELECT productid,SUM(quantity) AS quantity FROM bigsalestable
GROUP BY productid) AS s ON p.productid = s.productid;

And that would have given us a more efficient plan.

Of course, for these actual plans to be equivalent there would naturally
have to be a unique index on product_code in the products table.

I think I'm right in thinking that if a unique index exists to match the
group by clause, and the join condition is equality (probably using the same
operator class as the unique btree index?), then the grouping could be
pushed up to before the join.

In the attached script the hand optimised query runs about twice as fast
with 1 million record sales table.

The existence of join removal makes me think we don't want to always allow
it up to who or what is writing the query to allow the best choice in plan.

I'm not really skilled enough in the arts of postgresql's planner to look
into how hard it would be to implement this, but I thought I'd throw it out
there to collect some comments on the idea.

Regards

David Rowley

Attachments:

groupby.sqltext/plain; name=groupby.sqlDownload
#2Kevin Grittner
kgrittn@mail.com
In reply to: David Rowley (#1)
Re: Functional dependency in GROUP BY through JOINs

David Rowley wrote:

If we wanted to see the sales per product we could write
something like this:

SELECT p.product_code,SUM(s.quantity)
FROM products p
INNER JOIN bigsalestable s ON p.productid = s.productid
GROUP BY p.product_code;

Though this plan might not be quite as optimal as it could be as
it performs the grouping after the join.

Of course the query could have been written in the first place
as:

SELECT p.product_code,s.quantity
FROM products AS p
INNER JOIN (SELECT productid,SUM(quantity) AS quantity
            FROM bigsalestable GROUP BY productid) AS s
  ON p.productid = s.productid;

And that would have given us a more efficient plan.

Of course, for these actual plans to be equivalent there would
naturally have to be a unique index on product_code in the
products table.

I think I'm right in thinking that if a unique index exists to
match the group by clause, and the join condition is equality
(probably using the same operator class as the unique btree
index?), then the grouping could be pushed up to before the join.

Off-hand, it seems equivalent to me; I don't know how much work it
would be.

Out of curiosity, does the first query's plan change if you run
this instead?:

SELECT s.product_code,SUM(s.quantity)
FROM products p
INNER JOIN bigsalestable s ON p.productid = s.productid
GROUP BY s.product_code;

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: David Rowley (#1)
Re: Functional dependency in GROUP BY through JOINs

On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:

Though this plan might not be quite as optimal as it could be as it performs
the grouping after the join.

PostgreSQL always calculates aggregation as the last step.

It's a well known optimisation to push-down GROUP BY clauses to the
lowest level, but we don't do that, yet.

You're right that it can make a massive difference to many queries.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#3)
Re: Functional dependency in GROUP BY through JOINs

Simon Riggs <simon@2ndQuadrant.com> writes:

On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:

Though this plan might not be quite as optimal as it could be as it performs
the grouping after the join.

PostgreSQL always calculates aggregation as the last step.

It's a well known optimisation to push-down GROUP BY clauses to the
lowest level, but we don't do that, yet.

You're right that it can make a massive difference to many queries.

In the case being presented here, it's not apparent to me that there's
any advantage to be had at all. You still need to aggregate over the
rows joining to each uniquely-keyed row. So how exactly are you going
to "push down the GROUP BY", and where does the savings come from?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#4)
Re: Functional dependency in GROUP BY through JOINs

On 6 December 2012 17:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:

Though this plan might not be quite as optimal as it could be as it performs
the grouping after the join.

PostgreSQL always calculates aggregation as the last step.

It's a well known optimisation to push-down GROUP BY clauses to the
lowest level, but we don't do that, yet.

You're right that it can make a massive difference to many queries.

In the case being presented here, it's not apparent to me that there's
any advantage to be had at all. You still need to aggregate over the
rows joining to each uniquely-keyed row. So how exactly are you going
to "push down the GROUP BY", and where does the savings come from?

David presents SQL that shows how that is possible.

In terms of operators, after push down we aggregate 1 million rows and
then join 450. Which seems cheaper than join 1 million rows and
aggregate 1 million. So we're passing nearly 1 million fewer rows into
the join.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kevin Grittner
kgrittn@mail.com
In reply to: Simon Riggs (#5)
Re: Functional dependency in GROUP BY through JOINs

Tom Lane wrote:

In the case being presented here, it's not apparent to me that
there's any advantage to be had at all.

The OP reported a different plan which was twice as fast, although
showing EXPLAIN ANALYZE results for both would be nice confirmation
of that.

You still need to aggregate over the rows joining to each
uniquely-keyed row.

Yes.

So how exactly are you going to "push down the GROUP BY", and
where does the savings come from?

There are 1000000 million rows in bigsalestable that fall into 450
groups. The question is whether you look up related data in the
products table once per row or once per group. Apparently those
extra 999550 lookups take enough time to matter.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7David Rowley
dgrowleyml@gmail.com
In reply to: Kevin Grittner (#2)
Re: Functional dependency in GROUP BY through JOINs

From: Kevin Grittner [mailto:kgrittn@mail.com]

I think I'm right in thinking that if a unique index exists to match
the group by clause, and the join condition is equality (probably
using the same operator class as the unique btree index?), then the
grouping could be pushed up to before the join.

Off-hand, it seems equivalent to me; I don't know how much work it would
be.

Out of curiosity, does the first query's plan change if you run this instead?:

SELECT s.product_code,SUM(s.quantity)
FROM products p
INNER JOIN bigsalestable s ON p.productid = s.productid GROUP BY
s.product_code;

I should have made it more clear about the lack of product_code in the bigsalestable, there's only productid, the lookup to the product table was to obtain the actual more user friendly product_code.

The actual table def's are in the attachment of the original email.

Regards

David Rowley

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#4)
Re: Functional dependency in GROUP BY through JOINs

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 07 December 2012 06:22
To: Simon Riggs
Cc: David Rowley; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Functional dependency in GROUP BY through JOINs

Simon Riggs <simon@2ndQuadrant.com> writes:

On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com>

wrote:

Though this plan might not be quite as optimal as it could be as it
performs the grouping after the join.

PostgreSQL always calculates aggregation as the last step.

It's a well known optimisation to push-down GROUP BY clauses to the
lowest level, but we don't do that, yet.

You're right that it can make a massive difference to many queries.

In the case being presented here, it's not apparent to me that there's any
advantage to be had at all. You still need to aggregate over the rows

joining

to each uniquely-keyed row. So how exactly are you going to "push down
the GROUP BY", and where does the savings come from?

I guess the saving is, in at least this case it's because the join only
joins 10 rows on either side, but I think also because the grouping would
also be cheaper because it's done on an INT rather than CHAR().
But I'm thinking you're meaning the planner would have to know this is
cheaper and compare both versions of the plan.

I should have showed the plan of the other nested query originally, so here
it is this time.

Version = 9.2.1

test=# EXPLAIN ANALYZE
test-# SELECT p.product_code,SUM(s.quantity)
test-# FROM products p
test-# INNER JOIN bigsalestable s ON p.productid = s.productid
test-# GROUP BY p.product_code;
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------
HashAggregate (cost=18926.22..18926.32 rows=10 width=150) (actual
time=553.403..553.405 rows=10 loops=1)
-> Hash Join (cost=1.23..18676.22 rows=50000 width=150) (actual
time=0.041..324.970 rows=1000000 loops=1)
Hash Cond: (s.productid = p.productid)
-> Seq Scan on bigsalestable s (cost=0.00..14425.00 rows=1000000
width=8) (actual time=0.012..70.627 rows=1000000 loops=1)
-> Hash (cost=1.10..1.10 rows=10 width=150) (actual
time=0.013..0.013 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on products p (cost=0.00..1.10 rows=10
width=150) (actual time=0.004..0.007 rows=10 loops=1)
Total runtime: 553.483 ms
(8 rows)

test=# EXPLAIN ANALYZE
test-# SELECT p.product_code,s.quantity
test-# FROM products AS p
test-# INNER JOIN (SELECT productid,SUM(quantity) AS quantity FROM
bigsalestable GROUP BY productid) AS s ON p.productid = s.productid;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------------------------------
Hash Join (cost=19426.22..19431.07 rows=10 width=154) (actual
time=295.548..295.557 rows=10 loops=1)
Hash Cond: (bigsalestable.productid = p.productid)
-> HashAggregate (cost=19425.00..19427.00 rows=200 width=8) (actual
time=295.514..295.518 rows=10 loops=1)
-> Seq Scan on bigsalestable (cost=0.00..14425.00 rows=1000000
width=8) (actual time=0.004..59.330 rows=1000000 loops=1)
-> Hash (cost=1.10..1.10 rows=10 width=150) (actual time=0.017..0.017
rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on products p (cost=0.00..1.10 rows=10 width=150)
(actual time=0.010..0.012 rows=10 loops=1)
Total runtime: 295.612 ms
(8 rows)

Regards

David Rowley

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9David Rowley
dgrowleyml@gmail.com
In reply to: Simon Riggs (#3)
Re: Functional dependency in GROUP BY through JOINs

From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: 07 December 2012 05:44
To: David Rowley
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Functional dependency in GROUP BY through JOINs

On 5 December 2012 23:37, David Rowley <dgrowleyml@gmail.com> wrote:

Though this plan might not be quite as optimal as it could be as it
performs the grouping after the join.

PostgreSQL always calculates aggregation as the last step.

I didn't know that was always the case, but it makes sense I guess.
This is probably a bigger project than I imagined it would be then.

It's a well known optimisation to push-down GROUP BY clauses to the lowest
level, but we don't do that, yet.

You're right that it can make a massive difference to many queries.

I agree.

Maybe it'd be something worthwhile for the future then. Perhaps if others
agree it should be something to go on the TODO list?

Regards

David Rowley

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers