Removing Functionally Dependent GROUP BY Columns
We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.
For example a query such as;
SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;
is perfectly fine in PostgreSQL, as p.description is functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).
It seems that there's no shortage of relational databases in existence
today which don't support this. These databases would require the GROUP BY
clause to include the p.description column too.
It seems rather unfortunate that people who migrate applications to
PostgreSQL may not be aware that we support this, as currently if we
needlessly include the p.description column, PostgreSQL naively includes
this column while grouping. These people could well be incurring a
performance penalty due to our planner not removing the useless items from
the list, as if the primary key is present, then including any other
columns won't cause splitting of the groups any further, all other columns
from the *same relation* can simply be removed from the GROUP BY clause.
There are in fact also two queries in TPC-H (Q10 and Q18) which are written
to include all of the non-aggregated column in the GROUP BY list. During a
recent test I witnessed a 50% gain in performance in Q10 by removing the
unneeded columns from the GROUP BY clause.
I've attached a patch which implements this in PostgreSQL.
The patch may need a little more work in order to adjust the targetlist's
tleSortGroupRefs to remove invalid ones and perhaps also remove the gaps.
I'm posting this now so that I can gauge the community interest in this.
Is it something that we'd like to have in PostgreSQL?
--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
prune_group_by_clause_2027f512_2015-12-01.patchapplication/octet-stream; name=prune_group_by_clause_2027f512_2015-12-01.patchDownload+285-6
On 2015-12-01 05:00, David Rowley wrote:
We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.For example a query such as;
SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;is perfectly fine in PostgreSQL, as p.description is functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).
This has come up before (on other forums, at least), and my main concern
has been that unlike the case where we go from throwing an error to
allowing a query, this has a chance to make the planning of currently
legal queries slower. Have you tried to measure the impact of this on
queries where there's no runtime gains to be had?
.m
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to> wrote:
On 2015-12-01 05:00, David Rowley wrote:
We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.For example a query such as;
SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;is perfectly fine in PostgreSQL, as p.description is functionally
dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).This has come up before (on other forums, at least), and my main concern
has been that unlike the case where we go from throwing an error to
allowing a query, this has a chance to make the planning of currently legal
queries slower. Have you tried to measure the impact of this on queries
where there's no runtime gains to be had?
I've performed a series of benchmarks on the following queries:
Test1: explain select id1,id2 from t1 group by id1,id2;
Test2: explain select id from t2 group by id;
Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on t1.id1=t2.id
group by t1.id1,t1.id2;
I ran each of these with pgbench for 60 seconds, 3 runs per query. In each
case below I've converted the TPS into seconds using the average TPS over
the 3 runs.
In summary:
Test1 is the worst case test. It's a very simple query so planning overhead
of join searching is non-existent. The fact that there's 2 columns in the
GROUP BY means that the fast path cannot be used. I added this as if
there's only 1 column in the GROUP BY then there's no point in searching
for something to remove.
Average (Sec)
Master 0.0001043117
Patched 0.0001118961
Performance 93.22%
Microseconds of planning overhead 7.5844326722
Test2 is a simple query with a GROUP BY which can fast path due to there
being only 1 GROUP BY column.
Average (Sec)
Master 0.000099374448
Patched 0.000099670124
Performance 99.70%
Microseconds of planning overhead 0.2956763193
Test3 is a slightly more complex and is aimed to show that the percentage
of planning overhead is smaller when joins exist and overall planning cost
becomes higher
Average (Sec)
Master 0.0001797165
Patched 0.0001798406
Performance 99.93%
Microseconds of planning overhead 0.1240776236
Test3 results seem a bit strange, I would have expected more of a slowdown.
I ran the test again to make sure, and it came back with the same results
the 2nd time.
I've attached the spreadsheet that used to collect the results, and also
the raw pgbench output.
It seems that the worst case test adds about 7.6 microseconds onto planning
time. To get this worse case result I had to add two GROUP BY columns, as
having only 1 triggers a fast path as the code knows it can't remove any
columns, since there's only 1. A similar fast path also exists which will
only lookup the PRIMARY KEY details if there's more than 1 column per
relation in the GROUP BY, so for example GROUP BY rel1.col1, rel2.col1
won't lookup any PRIMARY KEY constraint.
Given that the extra code really only does anything if the GROUP BY has 2
or more expressions, are you worried that this will affect too many short
and fast to execute queries negatively?
--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 30, 2015 at 11:00 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
There are in fact also two queries in TPC-H (Q10 and Q18) which are written
to include all of the non-aggregated column in the GROUP BY list. During a
recent test I witnessed a 50% gain in performance in Q10 by removing the
unneeded columns from the GROUP BY clause.
Wow, that's pretty impressive. +1 for trying to do something about this.
(Full disclosure: I have not read your patch.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/30/15 11:00 PM, David Rowley wrote:
It seems that there's no shortage of relational databases in existence
today which don't support this. These databases would require the GROUP
BY clause to include the p.description column too.
Well, actually, we implemented this because other databases had it and
users kept complaining to us.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3 December 2015 at 14:28, Peter Eisentraut <peter_e@gmx.net> wrote:
On 11/30/15 11:00 PM, David Rowley wrote:
It seems that there's no shortage of relational databases in existence
today which don't support this. These databases would require the GROUP
BY clause to include the p.description column too.Well, actually, we implemented this because other databases had it and
users kept complaining to us.
Most likely you mean MySQL? I believe there was a bit of an influx in
people emigrating away from that database not in the too distant past. It's
not too surprising we picked up some of those people, and some of the
features that they were used to at the same time... IF NOT EXISTS perhaps
is a good example to back that up.
However, I think your comment makes it quite clear that we're not talking
about the same database management systems.
For me, I tested the most popular 2 commercial ones and found neither of
these supported columns in the SELECT list which were functionally
dependent on the GROUP BY clause, unless the columns themselves were in the
GROUP BY clause. I admit my "no shortage" comment was based on these two
only. I also admit that I don't possess any statistics which show which
RDBMSs users are migrating from. I merely thought that testing the two most
popular commercial ones was enough justification to write what I wrote.
--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services
On 01/12/2015 12:07, David Rowley wrote:
On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to
<mailto:marko@joh.to>> wrote:On 2015-12-01 05:00, David Rowley wrote:
We already allow a SELECT's target list to contain
non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.For example a query such as;
SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;is perfectly fine in PostgreSQL, as p.description is
functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).This has come up before (on other forums, at least), and my main
concern has been that unlike the case where we go from throwing an
error to allowing a query, this has a chance to make the planning of
currently legal queries slower. Have you tried to measure the
impact of this on queries where there's no runtime gains to be had?I've performed a series of benchmarks on the following queries:
Test1: explain select id1,id2 from t1 group by id1,id2;
Test2: explain select id from t2 group by id;
Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on
t1.id1=t2.id <http://t2.id> group by t1.id1,t1.id2;I ran each of these with pgbench for 60 seconds, 3 runs per query. In
each case below I've converted the TPS into seconds using the average
TPS over the 3 runs.In summary:
Test1 is the worst case test. It's a very simple query so planning
overhead of join searching is non-existent. The fact that there's 2
columns in the GROUP BY means that the fast path cannot be used. I added
this as if there's only 1 column in the GROUP BY then there's no point
in searching for something to remove.[...]
It seems that the worst case test adds about 7.6 microseconds onto
planning time. To get this worse case result I had to add two GROUP BY
columns, as having only 1 triggers a fast path as the code knows it
can't remove any columns, since there's only 1. A similar fast path also
exists which will only lookup the PRIMARY KEY details if there's more
than 1 column per relation in the GROUP BY, so for example GROUP BY
rel1.col1, rel2.col1 won't lookup any PRIMARY KEY constraint.Given that the extra code really only does anything if the GROUP BY has
2 or more expressions, are you worried that this will affect too many
short and fast to execute queries negatively?
In identify_key_vars()
+ /* Only PK constraints are of interest for now, see comment above */
+ if (con->contype != CONSTRAINT_PRIMARY)
+ continue;
I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).
+ /*
+ * Exit if the constraint is deferrable, there's no point in looking
+ * for another PK constriant, as there can only be one.
+ */
small typo on "constriant"
+ {
+ varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+ found_col = true;
+ /* don't break, there might be dupicates */
+ }
small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?
in remove_useless_groupby_columns()
+ /*
+ * Skip if there were no Vars found in the GROUP BY clause which
belong
+ * to this relation. We can also skip if there's only 1 Var, as
there
+ * needs to be more than one Var for there to be any columns
which are
+ * functionally dependent on another column.
+ */
This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.
+ /*
+ * If we found any surplus Vars in the GROUP BY clause, then we'll build
+ * a new GROUP BY clause without these surplus Vars.
+ */
+ if (anysurplus)
+ {
+ List *new_groupby = NIL;
+
+ foreach (lc, root->parse->groupClause)
+ {
+ SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+ TargetEntry *tle;
+ Var *var;
+
+ tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+ var = (Var *) tle->expr;
+
+ if (!IsA(var, Var))
+ continue;
+ [...]
if var isn't a Var, it needs to be added in new_groupby.
+ /* XXX do we to alter tleSortGroupRef to remove gaps? */
no idea on that :/
--
Julien Rouhaud
http://dalibo.com - http://dalibo.org
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Many thanks for the thorough review!
On 12 January 2016 at 23:37, Julien Rouhaud <julien.rouhaud@dalibo.com>
wrote:
In identify_key_vars()
+ /* Only PK constraints are of interest for now, see comment above */ + if (con->contype != CONSTRAINT_PRIMARY) + continue;I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).
I've improved this by explaining it more clearly.
+ /* + * Exit if the constraint is deferrable, there's no point in looking + * for another PK constriant, as there can only be one. + */small typo on "constriant"
Fixed.
+ { + varlistmatches = bms_add_member(varlistmatches, varlistidx); + found_col = true; + /* don't break, there might be dupicates */ + }small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?
No I missed something :) You are right. I've put a break; here and a
comment to explain why.
in remove_useless_groupby_columns()
+ /* + * Skip if there were no Vars found in the GROUP BY clause which belong + * to this relation. We can also skip if there's only 1 Var, as there + * needs to be more than one Var for there to be any columns which are + * functionally dependent on another column. + */This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.
Yes it was badly worded. I've fixed in the attached.
+ /* + * If we found any surplus Vars in the GROUP BY clause, then we'll build + * a new GROUP BY clause without these surplus Vars. + */ + if (anysurplus) + { + List *new_groupby = NIL; + + foreach (lc, root->parse->groupClause) + { + SortGroupClause *sgc = (SortGroupClause *) lfirst(lc); + TargetEntry *tle; + Var *var; + + tle = get_sortgroupclause_tle(sgc, root->parse->targetList); + var = (Var *) tle->expr; + + if (!IsA(var, Var)) + continue; + [...]if var isn't a Var, it needs to be added in new_groupby.
Yes you're right, well, at least I've written enough code to ensure that
it's not needed.
Technically we could look inside non-Vars and check if the Expr is just
made up of Vars from a single relation, and then we could also make that
surplus if we find other Vars which make up the table's primary key. I
didn't make these changes as I thought it was a less likely scenario. It
wouldn't be too much extra code to add however. I've went and added an XXX
comment to explain that there might be future optimisation opportunities in
the future around this.
I've attached an updated patch.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
prune_group_by_clause_90b5f3c_2016-01-14.patchapplication/octet-stream; name=prune_group_by_clause_90b5f3c_2016-01-14.patchDownload+310-6
On 14/01/2016 01:30, David Rowley wrote:
Many thanks for the thorough review!
And thanks for the patch and fast answer!
On 12 January 2016 at 23:37, Julien Rouhaud <julien.rouhaud@dalibo.com
<mailto:julien.rouhaud@dalibo.com>> wrote:In identify_key_vars()
+ /* Only PK constraints are of interest for now, see comment above */ + if (con->contype != CONSTRAINT_PRIMARY) + continue;I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).I've improved this by explaining it more clearly.
Very nice.
Small typo though:
+ * Technically we could look at UNIQUE indexes too, however we'd also
+ * have to ensure that each column of the unique index had a NOT NULL
s/had/has/
+ * constraint, however since NOT NULL constraints currently are don't
s/are //
[...]
+ { + varlistmatches = bms_add_member(varlistmatches, varlistidx); + found_col = true; + /* don't break, there might be dupicates */ + }small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?No I missed something :) You are right. I've put a break; here and a
comment to explain why.
ok :)
[...]
+ /* + * If we found any surplus Vars in the GROUP BY clause, then we'll build + * a new GROUP BY clause without these surplus Vars. + */ + if (anysurplus) + { + List *new_groupby = NIL; + + foreach (lc, root->parse->groupClause) + { + SortGroupClause *sgc = (SortGroupClause *) lfirst(lc); + TargetEntry *tle; + Var *var; + + tle = get_sortgroupclause_tle(sgc, root->parse->targetList); + var = (Var *) tle->expr; + + if (!IsA(var, Var)) + continue; + [...]if var isn't a Var, it needs to be added in new_groupby.
Yes you're right, well, at least I've written enough code to ensure that
it's not needed.
Oh yes, I realize that now.
Technically we could look inside non-Vars and check if the Expr is just
made up of Vars from a single relation, and then we could also make that
surplus if we find other Vars which make up the table's primary key. I
didn't make these changes as I thought it was a less likely scenario. It
wouldn't be too much extra code to add however. I've went and added an
XXX comment to explain that there might be future optimisation
opportunities in the future around this.
Agreed.
I've attached an updated patch.
+ /* don't try anything unless there's two Vars */
+ if (varlist == NULL || list_length(varlist) < 2)
+ continue;
To be perfectly correct, the comment should say "at least two Vars".
Except the small remaining typos, this patch looks very fine to me. I
flag it as ready for committer.
--
Julien Rouhaud
http://dalibo.com - http://dalibo.org
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14 January 2016 at 11:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
+ /* don't try anything unless there's two Vars */ + if (varlist == NULL || list_length(varlist) < 2) + continue;To be perfectly correct, the comment should say "at least two Vars".
Apologies for butting in and I appreciate I don't have any ownership
over this codebase or right to suggest any changes, but this just
caught my eye before I could hit "delete".
My mantra tends to be "why, not what" for inline comments; in this
case you can get the same information from the next line of code as
you get from the comment.
Perhaps something like
/* it's clearly impossible to remove duplicates if there are fewer
than two GROUPBY columns */
might be more helpful?
(also sorry if I've misunderstood what it _actually_ does, I just made
an assumption based on reading this thread!)
Geoff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14/01/2016 14:04, Geoff Winkless wrote:
On 14 January 2016 at 11:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
+ /* don't try anything unless there's two Vars */ + if (varlist == NULL || list_length(varlist) < 2) + continue;To be perfectly correct, the comment should say "at least two Vars".
Apologies for butting in and I appreciate I don't have any ownership
over this codebase or right to suggest any changes, but this just
caught my eye before I could hit "delete".My mantra tends to be "why, not what" for inline comments; in this
case you can get the same information from the next line of code as
you get from the comment.
You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mail:
+ *[...] If there are no Vars then
+ * nothing need be done for this rel. If there's less than two Vars then
+ * there's no room to remove any, as the PRIMARY KEY must have at
least one
+ * Var, so we can safely also do nothing if there's less than two Vars.
so I assume it's ok to keep it this way.
Perhaps something like
/* it's clearly impossible to remove duplicates if there are fewer
than two GROUPBY columns */might be more helpful?
(also sorry if I've misunderstood what it _actually_ does, I just made
an assumption based on reading this thread!)
--
Julien Rouhaud
http://dalibo.com - http://dalibo.org
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14 January 2016 at 13:16, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mail
Fair enough, although I have two niggles with that:
a) the second comment could become physically separated from the first
by later additions of extra code, or by refactoring;
b) if you don't need the comment because the explanation for it is
local anyway and the comment tells you nothing that the code doesn't,
why have it at all?
so I assume it's ok to keep it this way.
Of course it's ok to do whatever you decide is best: as I said
previously, I fully appreciate that I have no ownership over any of
the code.
Geoff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14/01/2016 14:29, Geoff Winkless wrote:
On 14 January 2016 at 13:16, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mailFair enough, although I have two niggles with that:
a) the second comment could become physically separated from the first
by later additions of extra code, or by refactoring;
b) if you don't need the comment because the explanation for it is
local anyway and the comment tells you nothing that the code doesn't,
why have it at all?
I agree. If I had to choose, I'd vote for removing it.
so I assume it's ok to keep it this way.
Of course it's ok to do whatever you decide is best: as I said
previously, I fully appreciate that I have no ownership over any of
the code.
Neither do I, I'm just reviewer here just as you :)
--
Julien Rouhaud
http://dalibo.com - http://dalibo.org
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 January 2016 at 00:19, Julien Rouhaud <julien.rouhaud@dalibo.com>
wrote:
+ * Technically we could look at UNIQUE indexes too, however we'd also + * have to ensure that each column of the unique index had a NOT NULLs/had/has/
+ * constraint, however since NOT NULL constraints
currently are don'ts/are //
Both fixed. Thanks.
+ /* + * If we found any surplus Vars in the GROUP BY clause, then we'll build + * a new GROUP BY clause without these surplus Vars. + */ + if (anysurplus) + { + List *new_groupby = NIL; + + foreach (lc, root->parse->groupClause) + { + SortGroupClause *sgc = (SortGroupClause *) lfirst(lc); + TargetEntry *tle; + Var *var; + + tle = get_sortgroupclause_tle(sgc,root->parse->targetList);
+ var = (Var *) tle->expr; + + if (!IsA(var, Var)) + continue; + [...]if var isn't a Var, it needs to be added in new_groupby.
Yes you're right, well, at least I've written enough code to ensure that
it's not needed.Oh yes, I realize that now.
I meant to say "I've not written enough code" ...
Technically we could look inside non-Vars and check if the Expr is just
made up of Vars from a single relation, and then we could also make that
surplus if we find other Vars which make up the table's primary key. I
didn't make these changes as I thought it was a less likely scenario. It
wouldn't be too much extra code to add however. I've went and added an
XXX comment to explain that there might be future optimisation
opportunities in the future around this.Agreed.
I've attached an updated patch.
+ /* don't try anything unless there's two Vars */ + if (varlist == NULL || list_length(varlist) < 2) + continue;To be perfectly correct, the comment should say "at least two Vars".
Changed per discussion from you and Geoff
Except the small remaining typos, this patch looks very fine to me. I
flag it as ready for committer.
Many thanks for your followup review.
I've attached an updated patch to address what you highlighted above.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
prune_group_by_clause_59f15cf_2016-01-15.patchapplication/octet-stream; name=prune_group_by_clause_59f15cf_2016-01-15.patchDownload+310-6
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 14/01/2016 23:05, David Rowley wrote:
On 15 January 2016 at 00:19, Julien Rouhaud
<julien.rouhaud@dalibo.com <mailto:julien.rouhaud@dalibo.com>>
wrote:+ * Technically we could look at UNIQUE indexes
too, however we'd also + * have to ensure that each
column of the unique index had a NOT NULLs/had/has/
+ * constraint, however since NOT NULL constraints
currently are don'ts/are //
Both fixed. Thanks.
+ /* + * If we found any surplus Vars in the GROUP BY
clause, then we'll build + * a new GROUP BY clause without
these surplus Vars. + */ + if (anysurplus) + { +
List *new_groupby = NIL; + + foreach (lc,
root->parse->groupClause) + { + SortGroupClause
*sgc = (SortGroupClause *) lfirst(lc); + TargetEntry
*tle; + Var *var; + + tle =
get_sortgroupclause_tle(sgc, root->parse->targetList); +
var = (Var *) tle->expr; + + if (!IsA(var, Var)) +
continue; + [...]if var isn't a Var, it needs to be added in new_groupby.
Yes you're right, well, at least I've written enough code to
ensure that it's not needed.Oh yes, I realize that now.
I meant to say "I've not written enough code" ...
Yes, that makes sense with the explanation you wrote just after :)
Technically we could look inside non-Vars and check if the Expr
is just made up of Vars from a single relation, and then we could
also make that surplus if we find other Vars which make up the
table's primary key. I didn't make these changes as I thought it
was a less likely scenario. It wouldn't be too much extra code to
add however. I've went and added an XXX comment to explain that
there might be future optimisation opportunities in the future
around this.Agreed.
I've attached an updated patch.
+ /* don't try anything unless there's two Vars */ +
if (varlist == NULL || list_length(varlist) < 2) +
continue;To be perfectly correct, the comment should say "at least two
Vars".Changed per discussion from you and Geoff
Except the small remaining typos, this patch looks very fine to me.
I flag it as ready for committer.Many thanks for your followup review.
I've attached an updated patch to address what you highlighted
above.
Thanks!
-- David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
- --
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
iQEcBAEBAgAGBQJWmCoYAAoJELGaJ8vfEpOqJzIIALajMHHd8aCDnAuH9uNUyevU
EuKyHTRDJkc8KUNbtDeSpVf9UGT3XUaZx4k/o+aKDdRB/RfYK0GKyDv2Owr4Wx3F
5BY9GuEO3vjqaFuDBH5u9EjySal3jQYC57nB3I0hRvpVRQBi0nFyQre/SXplCB6q
X1NqBfICyu6lwwocAMCeW9qN4ZQclLjxoScJpA4ml9Kj6CQvK2dDSyS00gstLFPH
Bj+n20wEcC7ZyxCpxfHmoZW1sjAvZK5mwrEdFG+lvCO8OwN/73YvDFzQHrpvVXWE
ZVoz069kfekZtXQ1OQ5CcvOAJLD9ewbPq+rpKh9YQorZB1R9QEj0qaxqo7LtjhA=
=G/uH
-----END PGP SIGNATURE-----
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
David Rowley <david.rowley@2ndquadrant.com> writes:
[ prune_group_by_clause_59f15cf_2016-01-15.patch ]
I spent some time looking through this but decided that it needs some work
to be committable.
The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list. I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable). The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().
The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say). I do not think
we need to cache it right now but I'd like to have the option to do so.
(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)
Some lesser points:
* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring. I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.
* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.
* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)
* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too. Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test. You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.
* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.
I'm going to set this back to "Waiting on Author". I think the basic
idea is good but it needs a rewrite.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I spent some time looking through this but decided that it needs some work
to be committable.The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list. I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable). The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().
That seems like a much better design to me too. I've made changes and
attached the updated patch.
The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say). I do not think
we need to cache it right now but I'd like to have the option to do so.
I've not touched that yet, but good idea.
(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)
I've attached a separate patch for that too. It applies after the
prunegroupby patch.
Some lesser points:
* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring. I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.
I've moved the call to subquery_planner()
* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.
:-( I've fixed that now.
* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)
Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.
* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too. Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test. You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.
I've changed to use a bitmapset now, but I've kept surplusvars, having
this allows a single pass over the group by clause to remove the
surplus Vars, rather than doing it again and again for each relation.
I've also found a better way so that array is only allocated if some
surplus Vars are found. I understand what you mean, and yes, it is
possible to get rid of it, but I'd need to still add something else to
mark that this rel's extra Vars are okay to be removed. I could do
that by adding another bitmapset and marking the relid in that, but I
quite like how I've changed it now as it saves having to check
varlevelsup again on the Vars in the GROUP BY clause, as I just use
bms_difference() on the originally found Vars subtracting off the PK
vars, and assume all of those are surplus.
* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.
The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.
I'm going to set this back to "Waiting on Author". I think the basic
idea is good but it needs a rewrite.
Thanks for the review and the good ideas.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 23/01/2016 10:44, David Rowley wrote:
On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I spent some time looking through this but decided that it needs some work
to be committable.The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list. I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable). The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().That seems like a much better design to me too. I've made changes and
attached the updated patch.The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say). I do not think
we need to cache it right now but I'd like to have the option to do so.I've not touched that yet, but good idea.
(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)I've attached a separate patch for that too. It applies after the
prunegroupby patch.
This refactoring makes the code much more better, +1 for me.
I also reviewed it, it looks fine. However, the following removed
comment could be kept for clarity:
- /* The PK is a subset of grouping_columns, so we win */
Some lesser points:
* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring. I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.I've moved the call to subquery_planner()
* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.:-( I've fixed that now.
* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too. Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test. You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.I've changed to use a bitmapset now, but I've kept surplusvars, having
this allows a single pass over the group by clause to remove the
surplus Vars, rather than doing it again and again for each relation.
I've also found a better way so that array is only allocated if some
surplus Vars are found. I understand what you mean, and yes, it is
possible to get rid of it, but I'd need to still add something else to
mark that this rel's extra Vars are okay to be removed. I could do
that by adding another bitmapset and marking the relid in that, but I
quite like how I've changed it now as it saves having to check
varlevelsup again on the Vars in the GROUP BY clause, as I just use
bms_difference() on the originally found Vars subtracting off the PK
vars, and assume all of those are surplus.
I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the
+ if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+ {
by something like
+ if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+ continue;
+
+ if (bms_is_subset(pkattnos, relattnos))
+ {
which may be cheaper.
Otherwise, I couldn't find any issue with this new version of the patch.
* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.I'm going to set this back to "Waiting on Author". I think the basic
idea is good but it needs a rewrite.Thanks for the review and the good ideas.
--
Julien Rouhaud
http://dalibo.com - http://dalibo.org
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 24 January 2016 at 00:56, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the+ if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1) + {by something like
+ if (bms_num_members(relattnos) <= bms_num_members(pkattnos)) + continue; + + if (bms_is_subset(pkattnos, relattnos)) + {which may be cheaper.
Thanks for looking over this again.
I actually had that part written quite a few different ways before I
finally decided to use bms_subset_compare. I didn't benchmark, but I
thought 1 function call was better than 2, as I had it as
bms_is_subset(pkattnos, relattnos) && !bms_is_subset(relattnos,
pkattnos), and again with !bms_equal() instead of the 2nd subset test.
I figured 1 function call was better than 2, so finally settled on
bms_subset_compare(). I'm thinking that 3 function calls likely won't
make things better. I can't imagine it's going to cost much either
way, so I doubt it's worth trying to check whats faster. Although the
thing about bms_num_members() is that it's going to loop over each
word in the bitmap no matter what, whereas a subset check can abort
early in some cases.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 23/01/2016 14:51, David Rowley wrote:
On 24 January 2016 at 00:56, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the+ if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1) + {by something like
+ if (bms_num_members(relattnos) <= bms_num_members(pkattnos)) + continue; + + if (bms_is_subset(pkattnos, relattnos)) + {which may be cheaper.
Thanks for looking over this again.
I actually had that part written quite a few different ways before I
finally decided to use bms_subset_compare. I didn't benchmark, but I
thought 1 function call was better than 2, as I had it as
bms_is_subset(pkattnos, relattnos) && !bms_is_subset(relattnos,
pkattnos), and again with !bms_equal() instead of the 2nd subset test.
I figured 1 function call was better than 2, so finally settled on
bms_subset_compare(). I'm thinking that 3 function calls likely won't
make things better. I can't imagine it's going to cost much either
way, so I doubt it's worth trying to check whats faster. Although the
thing about bms_num_members() is that it's going to loop over each
word in the bitmap no matter what, whereas a subset check can abort
early in some cases.
FWIW, this code was simplified example. bms_num_members(relattnos) is
already called a few lines before, so it'd be 1 function call against 2
function calls (and a var assignment).
--
Julien Rouhaud
http://dalibo.com - http://dalibo.org
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers