Functional dependencies and GROUP BY

Started by Peter Eisentrautalmost 16 years ago38 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

I have developed a patch that partially implements the "functional
dependency" feature that allows some columns to be omitted from the
GROUP BY clause if it can be shown that the columns are functionally
dependent on the columns in the group by clause and therefore guaranteed
to be unique per group. The full functional dependency deduction rules
are pretty big and arcane, so I concentrated on getting a useful subset
working. In particular:

When grouping by primary key, the other columns can be omitted, e.g.,

CREATE TABLE tab1 (a int PRIMARY KEY, b int);

SELECT a, b FROM tab1 GROUP BY a;

This is frequently requested by MySQL converts (and possibly others).

Also, when a column is compared with a constant, it can appear
ungrouped:

SELECT x, y FROM tab2 WHERE y = 5 GROUP BY x;

For lack of a better idea, I have made it so that merge-joinable
operators qualify as equality operators. Better ideas welcome.

Other rules could be added over time (but I'm current not planning to
work on that myself).

At this point, this patch could use some review and testing with unusual
queries that break my implementation. ;-)

Attachments:

functional-deps.patchtext/x-patch; charset=UTF-8; name=functional-deps.patchDownload+721-22
#2Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#1)
Re: Functional dependencies and GROUP BY

On Mon, Jun 7, 2010 at 7:33 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I have developed a patch that partially implements the "functional
dependency" feature

Nice! :)

--
greg

#3Hitoshi Harada
umi.tanuki@gmail.com
In reply to: Peter Eisentraut (#1)
Re: Functional dependencies and GROUP BY

2010/6/8 Peter Eisentraut <peter_e@gmx.net>:

I have developed a patch that partially implements the "functional
dependency" feature that allows some columns to be omitted from the
GROUP BY clause if it can be shown that the columns are functionally
dependent on the columns in the group by clause and therefore guaranteed
to be unique per group.  The full functional dependency deduction rules
are pretty big and arcane, so I concentrated on getting a useful subset
working.  In particular:

Also, when a column is compared with a constant, it can appear
ungrouped:

SELECT x, y FROM tab2 WHERE y = 5 GROUP BY x;

I don't see why it should be allowed. I see the insist that y must be
unique value so it is ok to be ungrouped but the point of discussion
is far from that; Semantically y is not grouping key.

In addition, what if y is implicitly a constant? For example,

SELECT x, y FROM tab2 WHERE y = a AND a = 5 GROUP BY x;

or there should be more complicated example including JOIN cases. I
don't believe we can detect all of such cases. If the simple case is
allowed, users don't understand why the complicated case doesn't allow
sometimes. So it'll not be consistent.

Finally, it may hide unintended bugs. ORM tools may make WHERE clause
in some conditions and don't in other conditions.

Regards,

--
Hitoshi Harada

#4Stephen Frost
sfrost@snowman.net
In reply to: Hitoshi Harada (#3)
Re: Functional dependencies and GROUP BY

* Hitoshi Harada (umi.tanuki@gmail.com) wrote:

I don't see why it should be allowed. I see the insist that y must be
unique value so it is ok to be ungrouped but the point of discussion
is far from that; Semantically y is not grouping key.

Ignoring the fact that it's terribly useful- isn't it part of the SQL
spec?

In addition, what if y is implicitly a constant? For example,

SELECT x, y FROM tab2 WHERE y = a AND a = 5 GROUP BY x;

Not sure I see the issue here?

Finally, it may hide unintended bugs. ORM tools may make WHERE clause
in some conditions and don't in other conditions.

Yeah, this one I really just done buy.. If an ORM tool doesn't write
correct SQL, then it's the ORM's fault, not ours.

Thanks,

Stephen

#5Stephen Frost
sfrost@snowman.net
In reply to: Peter Eisentraut (#1)
Re: Functional dependencies and GROUP BY

* Peter Eisentraut (peter_e@gmx.net) wrote:

This is frequently requested by MySQL converts (and possibly others).

I'd certainly love to see it- but let's not confuse people by implying
that it would actually act the way MySQL does. It wouldn't, because
what MySQL does is alot closer to 'distinct on' and is patently insane
to boot. Again, I'd *love* to see this be done in PG, but when we
document it and tell people about it, *please* don't say it's similar in
any way to MySQL's "oh, we'll just pick a random value from the columns
that aren't group'd on" implementation.

At this point, this patch could use some review and testing with unusual
queries that break my implementation. ;-)

I'll give it a shot... :)

Thanks!

Stephen

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: Functional dependencies and GROUP BY

Peter Eisentraut <peter_e@gmx.net> writes:

I have developed a patch that partially implements the "functional
dependency" feature that allows some columns to be omitted from the
GROUP BY clause if it can be shown that the columns are functionally
dependent on the columns in the group by clause and therefore guaranteed
to be unique per group.

The main objection to this is the same one I've had all along: it makes
the syntactic validity of a query dependent on what indexes exist for
the table. At minimum, that means that enforcing the check at parse
time is the Wrong Thing.

The var-compared-with-constant case seems like a big crock. Are we
really required to provide such a thing per spec?

I'm also fairly concerned about the performance of a check implemented
this way --- it's going to do a lot of work, and do it over and over
again as it traverses the query tree. At least some of that could be
alleviated after you move the check to the planner, just by virtue of
the index information already having been acquired ... but I'd still
suggest expending more than no effort on caching the results. For
instance, given "SELECT * FROM a_very_wide_table GROUP BY pk" you
shouldn't have to prove more than once that a_very_wide_table is
grouped by its PK.

regards, tom lane

#7Pavel Stehule
pavel.stehule@gmail.com
In reply to: Bruce Momjian (#2)
Re: Functional dependencies and GROUP BY

2010/6/7 Greg Stark <gsstark@mit.edu>:

On Mon, Jun 7, 2010 at 7:33 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I have developed a patch that partially implements the "functional
dependency" feature

Nice! :)

I like this idea too. It can simplify some queries and I believe - it
is very good marketing bonus for us.

Pavel

Show quoted text

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Hitoshi Harada (#3)
Re: Functional dependencies and GROUP BY

On tis, 2010-06-08 at 09:59 +0900, Hitoshi Harada wrote:

Also, when a column is compared with a constant, it can appear
ungrouped:

SELECT x, y FROM tab2 WHERE y = 5 GROUP BY x;

I don't see why it should be allowed. I see the insist that y must be
unique value so it is ok to be ungrouped but the point of discussion
is far from that; Semantically y is not grouping key.

I'm not sure what your argument is. If y is uniquely determined within
each group, then it's OK for it to be ungrouped. What other criteria do
you have in mind for determining that instead? It looks like you are
going by aesthetics. ;-)

In addition, what if y is implicitly a constant? For example,

SELECT x, y FROM tab2 WHERE y = a AND a = 5 GROUP BY x;

or there should be more complicated example including JOIN cases. I
don't believe we can detect all of such cases. If the simple case is
allowed, users don't understand why the complicated case doesn't allow
sometimes. So it'll not be consistent.

Yes, as I said, my implementation is incomplete in the sense that it
only recognizes some functional dependencies. To recognize the sort of
thing you show, you would need some kind of complex deduction or proof
engine, and that doesn't seem worthwhile, at least for me, at this
point.

#9Rob Wultsch
wultsch@gmail.com
In reply to: Stephen Frost (#5)
Re: Functional dependencies and GROUP BY

On Mon, Jun 7, 2010 at 6:41 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

This is frequently requested by MySQL converts (and possibly others).

I'd certainly love to see it- but let's not confuse people by implying
that it would actually act the way MySQL does.  It wouldn't, because
what MySQL does is alot closer to 'distinct on' and is patently insane
to boot.  Again, I'd *love* to see this be done in PG, but when we
document it and tell people about it, *please* don't say it's similar in
any way to MySQL's "oh, we'll just pick a random value from the columns
that aren't group'd on" implementation.

Preface: I work as a MySQL DBA (yeah, yeah, laugh it up...).

It has been my experience that the vast majority of the time when a
MySQL users make use of the "fine feature" which allows them to use
unaggregated columns which is not present in the GROUP BY clause in an
aggregating query they have made an error because they do not
understand GROUP BY. I have found this lack of understanding to be
very wide spread across the MySQL developer and *DBA* communities. I
also would really hesitate to compare this useful feature to the *fine
feature* present in MySQL. Due to a long standing bug
(http://bugs.mysql.com/bug.php?id=8510) it really is not possible to
get MySQL to behave sanely. It is my impression that many programs of
significant size that interact with MySQL have errors because of this
issue and it would be good to not claim to have made Postgres
compatible.

That said, I imagine if this feature could make it into the Postgres
tree it would be very useful.

Would I be correct in assuming that while this feature would make
query planning more expensive, it would also often decrease the cost
of execution?

Best,

Rob Wultsch
wultsch@gmail.com

#10Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#6)
Re: Functional dependencies and GROUP BY

On Tue, Jun 8, 2010 at 4:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

I have developed a patch that partially implements the "functional
dependency" feature that allows some columns to be omitted from the
GROUP BY clause if it can be shown that the columns are functionally
dependent on the columns in the group by clause and therefore guaranteed
to be unique per group.

The main objection to this is the same one I've had all along: it makes
the syntactic validity of a query dependent on what indexes exist for
the table.  At minimum, that means that enforcing the check at parse
time is the Wrong Thing.

It also needs to ensure that the plan is invalidated if the constraint
is dropped, which I assume amounts to the same thing.

--
greg

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#10)
Re: Functional dependencies and GROUP BY

Greg Stark <gsstark@mit.edu> writes:

On Tue, Jun 8, 2010 at 4:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The main objection to this is the same one I've had all along: it makes
the syntactic validity of a query dependent on what indexes exist for
the table. �At minimum, that means that enforcing the check at parse
time is the Wrong Thing.

It also needs to ensure that the plan is invalidated if the constraint
is dropped, which I assume amounts to the same thing.

Well, no, any cached plan will get invalidated if the index goes away.
The big problem with this implementation is that you could create a
*rule* (eg a view) containing a query whose validity depends on the
existence of an index. Dropping the index will not cause the rule
to be invalidated.

Perhaps the correct fix would be to mark stored query trees as having a
dependency on the index, so that dropping the index/constraint would
force a drop of the rule too. Just pushing the check to plan time, as
I suggested yesterday, isn't a very nice fix because it would result
in the rule unexpectedly starting to fail at execution.

regards, tom lane

#12Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#11)
Re: Functional dependencies and GROUP BY

On Tue, Jun 8, 2010 at 3:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, no, any cached plan will get invalidated if the index goes away.
The big problem with this implementation is that you could create a
*rule* (eg a view) containing a query whose validity depends on the
existence of an index.  Dropping the index will not cause the rule
to be invalidated.

Hm, I was incorrectly thinking of this as analogous to the cases of
plans that could be optimized based on the existence of a constraint.
For example removing columns from a sort key because they're unique.
But this is different because not just the plan but the validity of
the query itself is dependent on the constraint.

--
greg

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#8)
Re: Functional dependencies and GROUP BY

Peter Eisentraut <peter_e@gmx.net> writes:

On tis, 2010-06-08 at 09:59 +0900, Hitoshi Harada wrote:

In addition, what if y is implicitly a constant? For example,

SELECT x, y FROM tab2 WHERE y = a AND a = 5 GROUP BY x;

Yes, as I said, my implementation is incomplete in the sense that it
only recognizes some functional dependencies. To recognize the sort of
thing you show, you would need some kind of complex deduction or proof
engine, and that doesn't seem worthwhile, at least for me, at this
point.

The question is why bother to recognize *any* cases of this form.
I find it really semantically ugly to have the parser effectively
doing one deduction of this form when the main engine for that type
of deduction is elsewhere; so unless there is a really good argument
why we have to do this case (and NOT "it was pretty easy"), I don't
want to do it.

As far as I recall, at least 99% of the user requests for this type
of behavior, maybe 100%, would be satisfied by recognizing the
group-by-primary-key case. So I think we should do that and be happy.

regards, tom lane

#14Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#13)
Re: Functional dependencies and GROUP BY

On 6/8/10 5:21 PM +0300, Tom Lane wrote:

Peter Eisentraut<peter_e@gmx.net> writes:

On tis, 2010-06-08 at 09:59 +0900, Hitoshi Harada wrote:

In addition, what if y is implicitly a constant? For example,

SELECT x, y FROM tab2 WHERE y = a AND a = 5 GROUP BY x;

Yes, as I said, my implementation is incomplete in the sense that it
only recognizes some functional dependencies. To recognize the sort of
thing you show, you would need some kind of complex deduction or proof
engine, and that doesn't seem worthwhile, at least for me, at this
point.

The question is why bother to recognize *any* cases of this form.
I find it really semantically ugly to have the parser effectively
doing one deduction of this form when the main engine for that type
of deduction is elsewhere; so unless there is a really good argument
why we have to do this case (and NOT "it was pretty easy"), I don't
want to do it.

As far as I recall, at least 99% of the user requests for this type
of behavior, maybe 100%, would be satisfied by recognizing the
group-by-primary-key case. So I think we should do that and be happy.

+1

Regards,
Marko Tiikkaja

#15Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#11)
Re: Functional dependencies and GROUP BY

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Perhaps the correct fix would be to mark stored query trees as having a
dependency on the index, so that dropping the index/constraint would
force a drop of the rule too. Just pushing the check to plan time, as
I suggested yesterday, isn't a very nice fix because it would result
in the rule unexpectedly starting to fail at execution.

Alternatively, we could rewrite the rule (not unlike what we do for
"SELECT *") to actually add on the other implicitly grouped-by columns..
I don't know if that's better or worse than creating a dependency,
since if the constraint were dropped/changed, people might expect the
rule's output to change. Of course, as you mention, the alternative
would really be for the rule to just start failing.. Still, if I wanted
to change the constraint, it'd be alot nicer to just be able to change
it and, presuming I'm just adding a column to it or doing some other
change which wouldn't invalidate the rule, not have to drop/recreate
the rule.

Thanks,

Stephen

#16Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#13)
Re: Functional dependencies and GROUP BY

On Tue, Jun 8, 2010 at 3:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The question is why bother to recognize *any* cases of this form.
I find it really semantically ugly to have the parser effectively
doing one deduction of this form when the main engine for that type
of deduction is elsewhere; so unless there is a really good argument
why we have to do this case (and NOT "it was pretty easy"), I don't
want to do it.

Well it does appear to be there:

4.18.11 Known functional dependencies in the result of a <where clause>
...
If AP is an equality AND-component of the <search condition> simply
contained in the <where clause> and one comparand of AP is a column
reference CR, and the other comparand of AP is a <literal>, then let
CRC be the counterpart of CR in R. {} {CRC} is a known functional
dependency in R, where {} denotes the empty set.

NOTE 43 — An SQL-implementation may also choose to recognize {}
{CRC} as a known functional dependency if the other comparand is a
deterministic expression containing no column references.
...

Since Peter's not eager to implement the whole section -- which does
seem pretty baroque -- it's up to us to draw the line where we stop
coding and declare it good enough. I think we're all agreed that
grouping by a pk is clearly the most important case. It may be
important to get some other cases just so that the PK property carries
through other clauses such as joins and group bys. But ultimately the
only thing stopping us from implementing the whole thing is our
threshold of pain for writing and maintaining the extra code.

--
greg

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#15)
Re: Functional dependencies and GROUP BY

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Perhaps the correct fix would be to mark stored query trees as having a
dependency on the index, so that dropping the index/constraint would
force a drop of the rule too.

Alternatively, we could rewrite the rule (not unlike what we do for
"SELECT *") to actually add on the other implicitly grouped-by columns..
I don't know if that's better or worse than creating a dependency,
since if the constraint were dropped/changed, people might expect the
rule's output to change.

Hm. The problem with that is that one of the benefits we'd like to get
from this is an efficiency win: the generated plan ought to only group
by the PK, not uselessly sort/group by everything in the row. I suppose
we could have the planner reverse-engineer its way to that, but it seems
awfully slow and clunky to add on the extra columns and then reason our
way to removing them again.

regards, tom lane

#18Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#17)
Re: Functional dependencies and GROUP BY

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Hm. The problem with that is that one of the benefits we'd like to get
from this is an efficiency win: the generated plan ought to only group
by the PK, not uselessly sort/group by everything in the row. I suppose
we could have the planner reverse-engineer its way to that, but it seems
awfully slow and clunky to add on the extra columns and then reason our
way to removing them again.

That's certainly a good point. Another issue that I realized when
thinking about this again- if someone wanted to *drop* a column that's
part of a PK (since it turned out to not be necessary, for example), and
then wanted to recreate the rule based on what was stored in the
catalog, they wouldn't be able to without modifying it, and that's
certainly be annoying too.

Guess my 2c would be for creating the dependency. I really dislike the
idea of the rule just all of a sudden breaking.

Thanks,

Stephen

#19Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#13)
Re: Functional dependencies and GROUP BY

On tis, 2010-06-08 at 10:21 -0400, Tom Lane wrote:

The question is why bother to recognize *any* cases of this form.
I find it really semantically ugly to have the parser effectively
doing one deduction of this form when the main engine for that type
of deduction is elsewhere; so unless there is a really good argument
why we have to do this case (and NOT "it was pretty easy"), I don't
want to do it.

Yeah, I'm not horribly attached to it. I began to structure the code to
support multiple kinds of checks, and at the end only two kinds were
reasonably doable and useful. We can remove it if no one is interested
in it, which appears to be the case.

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#11)
Re: Functional dependencies and GROUP BY

On tis, 2010-06-08 at 10:05 -0400, Tom Lane wrote:

Perhaps the correct fix would be to mark stored query trees as having
a
dependency on the index, so that dropping the index/constraint would
force a drop of the rule too. Just pushing the check to plan time, as
I suggested yesterday, isn't a very nice fix because it would result
in the rule unexpectedly starting to fail at execution.

There are actually pretty explicit instructions about this in the SQL
standard:

<drop table constraint definition>

4) If QS is a <query specification> that contains an implicit or
explicit <group by clause> and that contains a column reference to a
column C in its <select list> that is not contained in an aggregated
argument of a <set function specification>, and if G is the set of
grouping columns of QS, and if the table constraint TC is needed to
conclude that G ↦ C is a known functional dependency in QS, then QS is
said to be dependent on TC.

5) If V is a view that contains a <query specification> that is
dependent on a table constraint TC, then V is said to be dependent on
TC.

So the dependency between the view/rule and the constraint/index needs
to be stored in the dependency system, and RESTRICT/CASCADE will take
effect.

#21Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#1)
#22Alex Hunsaker
badalex@gmail.com
In reply to: Peter Eisentraut (#21)
#23Peter Eisentraut
peter_e@gmx.net
In reply to: Alex Hunsaker (#22)
#24Alex Hunsaker
badalex@gmail.com
In reply to: Peter Eisentraut (#23)
#25Alex Hunsaker
badalex@gmail.com
In reply to: Alex Hunsaker (#22)
#26Alex Hunsaker
badalex@gmail.com
In reply to: Alex Hunsaker (#24)
#27Peter Eisentraut
peter_e@gmx.net
In reply to: Alex Hunsaker (#24)
#28Alex Hunsaker
badalex@gmail.com
In reply to: Peter Eisentraut (#27)
#29Alex Hunsaker
badalex@gmail.com
In reply to: Alex Hunsaker (#28)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Alex Hunsaker (#28)
#31Alex Hunsaker
badalex@gmail.com
In reply to: Peter Eisentraut (#30)
#32Peter Eisentraut
peter_e@gmx.net
In reply to: Alex Hunsaker (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#32)
#34Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#34)
#36Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#36)
#38Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#37)