GROUP BY before JOIN

Started by David Rowleyover 10 years ago4 messageshackers
Jump to latest
#1David Rowley
dgrowleyml@gmail.com

== Overview ==

As of today we always perform GROUP BY at the final stage, after each
relation in the query has been joined. This of course works, but it's not
always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not
null);
create table sale (sale_id int primary key, product_id int not null, qty
int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100)
x(x);
insert into sale select x.x,x.x%100+1, 10 from generate_series(1,1000000)
x(x);

Here we have a table with 100 products and another with 1 million sales
records.

If we wanted to see the total sales for each product we may write a query
such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on
s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping
happened after the join took place:

QUERY PLAN
--------------------------------------------------
HashAggregate
Group Key: s.product_id
-> Hash Join
Hash Cond: (s.product_id = p.product_id)
-> Seq Scan on sale s
-> Hash
-> Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate will
group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the join
we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id;

Which would do exactly that, and in this particular case it is much faster,
but it's not all rainbows and unicorns, as if the join happened to filter
out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the
predicate cannot be applied to the sale table, so in this case if we'd
written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id
where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option is
better and choose to perform the GROUP BY at estimated most efficient join
level.

So far I've not gotten as far as supporting queries which contain aggregate
functions, as I need the ability to combine aggregate states (Simon and I
proposed a patch for that here https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this feature.
As I explain above, we can't just always do the grouping at the first
possible opportunity due to the possibility of joins eliminating groups and
needless groups being created.

To support this in the planner I've invented a GroupingPath type and I've
also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to
RelOptInfo. This part I wasn't too sure about and I wondered if these
GroupingPaths should just be added to the normal list with add_path(), but
since these paths perform more work than normal paths that would give them
an unfair disadvantage and they could be thrown out. These paths are only
possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and the
row estimate is using the base rel's row count rather than the number of
groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working on
this, but I'm also posting in hope to get some early feedback on what I
have so far.

Of course there's lots more work to do here, aggregates need to be
supported, functional dependencies and transitive closures will also need
to be detected in a more complete implementation.

Here's what I have so far (attached)

Comments are most welcome. Thanks.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

groupbeforejoin_v0.1.patchapplication/octet-stream; name=groupbeforejoin_v0.1.patchDownload+480-3
#2Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#1)
Re: GROUP BY before JOIN

This looks like one example of general problem of finding optimal order for
SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A, B),
sql_op4(C, D), sql_op5(E, F)))) where sql_op can be SQL operation like
join, grouping, aggregation, projection, union, intersection, limit,
ordering etc. and A, B, C, D are tables. Given such a representation, how
do we find an optimal order of operation?

In you example the query which is defined like GROUPBY(JOIN(A, B)) might
have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
groups i.e. UNION is distributive over GROUPBY.

Right now, we use dynamic programming to find the optimal order for join
tree (make_one_rel). In the language above, we find the optimal order for
evaluating an expression consisting of JOIN operators by exploiting their
commutative and associative properties. If we could extend that to SQL
expression tree representing the query, we will be able to solve many of
such re-ordering problems using current path infrastructure. The dynamic
program for this would look something like below

root = top operator of the query
child = the operands of the root

cost of query = minimum of
1. cost of query without any mutation
2. cost of root and its child operator commuted, if there is only one child
for root operator and root operator and child operator are commutable
3. cost of root distributed over children operands if there are multiple
operands and root operator is distributive over child operators
4. ...
5. ...

Each operator root will have a RelOptInfo to which we add many paths
possible for performing that operation using different orders of
evaluation. Current add_path should eliminate the expensive ones.

This is going to explode the number of path (plan) trees considered by
planner, so we need to be judicious when to apply this technique or have a
rigorous pruning technique to discard costly paths early enough. This
approach might be better than trying to solve each re-ordering problem
separately and might go well with upper planner pathification being
discussed at
/messages/by-id/CA+TgmoaqgzhUX7frzddpGhkqT3rcNx-f0dF+ZqzsEfW-ARXAww@mail.gmail.com
.

On Tue, Aug 4, 2015 at 1:57 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

== Overview ==

As of today we always perform GROUP BY at the final stage, after each
relation in the query has been joined. This of course works, but it's not
always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not
null);
create table sale (sale_id int primary key, product_id int not null, qty
int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100)
x(x);
insert into sale select x.x,x.x%100+1, 10 from generate_series(1,1000000)
x(x);

Here we have a table with 100 products and another with 1 million sales
records.

If we wanted to see the total sales for each product we may write a query
such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on
s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping
happened after the join took place:

QUERY PLAN
--------------------------------------------------
HashAggregate
Group Key: s.product_id
-> Hash Join
Hash Cond: (s.product_id = p.product_id)
-> Seq Scan on sale s
-> Hash
-> Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate
will group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the join
we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id;

Which would do exactly that, and in this particular case it is much
faster, but it's not all rainbows and unicorns, as if the join happened to
filter out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the
predicate cannot be applied to the sale table, so in this case if we'd
written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id
where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option is
better and choose to perform the GROUP BY at estimated most efficient join
level.

So far I've not gotten as far as supporting queries which contain
aggregate functions, as I need the ability to combine aggregate states
(Simon and I proposed a patch for that here
https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this feature.
As I explain above, we can't just always do the grouping at the first
possible opportunity due to the possibility of joins eliminating groups and
needless groups being created.

To support this in the planner I've invented a GroupingPath type and I've
also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to
RelOptInfo. This part I wasn't too sure about and I wondered if these
GroupingPaths should just be added to the normal list with add_path(), but
since these paths perform more work than normal paths that would give them
an unfair disadvantage and they could be thrown out. These paths are only
possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and the
row estimate is using the base rel's row count rather than the number of
groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working on
this, but I'm also posting in hope to get some early feedback on what I
have so far.

Of course there's lots more work to do here, aggregates need to be
supported, functional dependencies and transitive closures will also need
to be detected in a more complete implementation.

Here's what I have so far (attached)

Comments are most welcome. Thanks.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#3David Rowley
dgrowleyml@gmail.com
In reply to: Ashutosh Bapat (#2)
Re: GROUP BY before JOIN

On 4 August 2015 at 21:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
wrote:

This looks like one example of general problem of finding optimal order
for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A,
B), sql_op4(C, D), sql_op5(E, F)))) where sql_op can be SQL operation like
join, grouping, aggregation, projection, union, intersection, limit,
ordering etc. and A, B, C, D are tables. Given such a representation, how
do we find an optimal order of operation?

In you example the query which is defined like GROUPBY(JOIN(A, B)) might
have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
groups i.e. UNION is distributive over GROUPBY.

I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B).
If the group by contained columns from relations A and B, then it wouldn't
be possible to group each one individually. We would have to wait until the
join was performed in that case.

Right now, we use dynamic programming to find the optimal order for join

tree (make_one_rel). In the language above, we find the optimal order for
evaluating an expression consisting of JOIN operators by exploiting their
commutative and associative properties. If we could extend that to SQL
expression tree representing the query, we will be able to solve many of
such re-ordering problems using current path infrastructure. The dynamic
program for this would look something like below

root = top operator of the query
child = the operands of the root

cost of query = minimum of
1. cost of query without any mutation
2. cost of root and its child operator commuted, if there is only one
child for root operator and root operator and child operator are commutable
3. cost of root distributed over children operands if there are multiple
operands and root operator is distributive over child operators
4. ...
5. ...

Each operator root will have a RelOptInfo to which we add many paths
possible for performing that operation using different orders of
evaluation. Current add_path should eliminate the expensive ones.

So in a query such as:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

Do you mean that there would be 2 RelOptInfos for the sale relation, one
without the GROUP BY, and one with? That would likely solve the issue I
have with join row estimates, but which of the 2 RelOptInfo would all of
the Vars in the parse tree be pointing to?

This is going to explode the number of path (plan) trees considered by
planner, so we need to be judicious when to apply this technique or have a
rigorous pruning technique to discard costly paths early enough. This
approach might be better than trying to solve each re-ordering problem
separately and might go well with upper planner pathification being
discussed at
/messages/by-id/CA+TgmoaqgzhUX7frzddpGhkqT3rcNx-f0dF+ZqzsEfW-ARXAww@mail.gmail.com
.

It's certainly going to increase that, but I'm not sure if 'explode' is the
best word as we at least require all of the rels seen in the GROUP BY
expressions and aggregate function parameters to be joined before we can
consider a GroupingPath. The patch
uses bms_is_subset(root->minimumGroupingRels, rel->relids) to determine
that.

On Tue, Aug 4, 2015 at 1:57 PM, David Rowley <david.rowley@2ndquadrant.com

wrote:

== Overview ==

As of today we always perform GROUP BY at the final stage, after each
relation in the query has been joined. This of course works, but it's not
always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not
null);
create table sale (sale_id int primary key, product_id int not null, qty
int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100)
x(x);
insert into sale select x.x,x.x%100+1, 10 from generate_series(1,1000000)
x(x);

Here we have a table with 100 products and another with 1 million sales
records.

If we wanted to see the total sales for each product we may write a query
such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on
s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping
happened after the join took place:

QUERY PLAN
--------------------------------------------------
HashAggregate
Group Key: s.product_id
-> Hash Join
Hash Cond: (s.product_id = p.product_id)
-> Seq Scan on sale s
-> Hash
-> Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate
will group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the
join we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id;

Which would do exactly that, and in this particular case it is much
faster, but it's not all rainbows and unicorns, as if the join happened to
filter out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the
predicate cannot be applied to the sale table, so in this case if we'd
written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id
where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option
is better and choose to perform the GROUP BY at estimated most efficient
join level.

So far I've not gotten as far as supporting queries which contain
aggregate functions, as I need the ability to combine aggregate states
(Simon and I proposed a patch for that here
https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this
feature. As I explain above, we can't just always do the grouping at the
first possible opportunity due to the possibility of joins eliminating
groups and needless groups being created.

To support this in the planner I've invented a GroupingPath type and I've
also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to
RelOptInfo. This part I wasn't too sure about and I wondered if these
GroupingPaths should just be added to the normal list with add_path(), but
since these paths perform more work than normal paths that would give them
an unfair disadvantage and they could be thrown out. These paths are only
possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and the
row estimate is using the base rel's row count rather than the number of
groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working
on this, but I'm also posting in hope to get some early feedback on what I
have so far.

Of course there's lots more work to do here, aggregates need to be
supported, functional dependencies and transitive closures will also need
to be detected in a more complete implementation.

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

#4Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#3)
Re: GROUP BY before JOIN

On Tue, Aug 4, 2015 at 4:00 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

On 4 August 2015 at 21:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com

wrote:

This looks like one example of general problem of finding optimal order
for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A,
B), sql_op4(C, D), sql_op5(E, F)))) where sql_op can be SQL operation like
join, grouping, aggregation, projection, union, intersection, limit,
ordering etc. and A, B, C, D are tables. Given such a representation, how
do we find an optimal order of operation?

In you example the query which is defined like GROUPBY(JOIN(A, B)) might
have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
groups i.e. UNION is distributive over GROUPBY.

I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B).
If the group by contained columns from relations A and B, then it wouldn't
be possible to group each one individually. We would have to wait until the
join was performed in that case.

Sorry for not being clear enough. GROUPBY(A) implies GROUPBY on relation A,
using some columns (unspecified but common to both A and B here). GROUPBY
is used as an operator over relations, in classic sense of operator. This
representation does not have any columns, join conditions specified for
simplicity. It specifies SQL operators on relations as operands.

Right now, we use dynamic programming to find the optimal order for join

tree (make_one_rel). In the language above, we find the optimal order for
evaluating an expression consisting of JOIN operators by exploiting their
commutative and associative properties. If we could extend that to SQL
expression tree representing the query, we will be able to solve many of
such re-ordering problems using current path infrastructure. The dynamic
program for this would look something like below

root = top operator of the query
child = the operands of the root

cost of query = minimum of
1. cost of query without any mutation
2. cost of root and its child operator commuted, if there is only one
child for root operator and root operator and child operator are commutable
3. cost of root distributed over children operands if there are multiple
operands and root operator is distributive over child operators
4. ...
5. ...

Each operator root will have a RelOptInfo to which we add many paths
possible for performing that operation using different orders of
evaluation. Current add_path should eliminate the expensive ones.

So in a query such as:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

Do you mean that there would be 2 RelOptInfos for the sale relation, one
without the GROUP BY, and one with? That would likely solve the issue I
have with join row estimates, but which of the 2 RelOptInfo would all of
the Vars in the parse tree be pointing to?

This query would be represented as GROUPBY(JOIN(sales, product)), grouping
happens on s.product_id and JOIN on s.product_id = p.product_id. The
dynamic programming would consider GROUPBY(JOIN(sales, product)) and
JOIN(GROUPBY(sales), GROUPBY(product)) as possible evaluation orders, since
GROUPBY is distributive over JOIN in this case and choose the order with
the least cost.

This is going to explode the number of path (plan) trees considered by
planner, so we need to be judicious when to apply this technique or have a
rigorous pruning technique to discard costly paths early enough. This
approach might be better than trying to solve each re-ordering problem
separately and might go well with upper planner pathification being
discussed at
/messages/by-id/CA+TgmoaqgzhUX7frzddpGhkqT3rcNx-f0dF+ZqzsEfW-ARXAww@mail.gmail.com
.

It's certainly going to increase that, but I'm not sure if 'explode' is
the best word as we at least require all of the rels seen in the GROUP BY
expressions and aggregate function parameters to be joined before we can
consider a GroupingPath. The patch
uses bms_is_subset(root->minimumGroupingRels, rel->relids) to determine
that.

On Tue, Aug 4, 2015 at 1:57 PM, David Rowley <
david.rowley@2ndquadrant.com> wrote:

== Overview ==

As of today we always perform GROUP BY at the final stage, after each
relation in the query has been joined. This of course works, but it's not
always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not
null);
create table sale (sale_id int primary key, product_id int not null, qty
int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100)
x(x);
insert into sale select x.x,x.x%100+1, 10 from
generate_series(1,1000000) x(x);

Here we have a table with 100 products and another with 1 million sales
records.

If we wanted to see the total sales for each product we may write a
query such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on
s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping
happened after the join took place:

QUERY PLAN
--------------------------------------------------
HashAggregate
Group Key: s.product_id
-> Hash Join
Hash Cond: (s.product_id = p.product_id)
-> Seq Scan on sale s
-> Hash
-> Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate
will group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the
join we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from
sale group by product_id) s inner join product p on s.product_id =
p.product_id;

Which would do exactly that, and in this particular case it is much
faster, but it's not all rainbows and unicorns, as if the join happened to
filter out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the
predicate cannot be applied to the sale table, so in this case if we'd
written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from
sale group by product_id) s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option
is better and choose to perform the GROUP BY at estimated most efficient
join level.

So far I've not gotten as far as supporting queries which contain
aggregate functions, as I need the ability to combine aggregate states
(Simon and I proposed a patch for that here
https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this
feature. As I explain above, we can't just always do the grouping at the
first possible opportunity due to the possibility of joins eliminating
groups and needless groups being created.

To support this in the planner I've invented a GroupingPath type and
I've also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to
RelOptInfo. This part I wasn't too sure about and I wondered if these
GroupingPaths should just be added to the normal list with add_path(), but
since these paths perform more work than normal paths that would give them
an unfair disadvantage and they could be thrown out. These paths are only
possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and
the row estimate is using the base rel's row count rather than the number
of groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working
on this, but I'm also posting in hope to get some early feedback on what I
have so far.

Of course there's lots more work to do here, aggregates need to be
supported, functional dependencies and transitive closures will also need
to be detected in a more complete implementation.

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company