GROUP BY DISTINCT
When combining multiple grouping items, such as rollups and cubes, the
resulting flattened grouping sets can contain duplicate items. The
standard provides for this by allowing GROUP BY DISTINCT to deduplicate
them prior to doing the actual work.
For example:
GROUP BY ROLLUP (a,b), ROLLUP (a,c)
expands to the sets:
(a,b,c), (a,b), (a,b), (a,c), (a), (a), (a,c), (a), ()
but:
GROUP BY DISTINCT ROLLUP (a,b), ROLLUP (a,c)
expands to just the sets:
(a,b,c), (a,b), (a,c), (a), ()
Attached is a patch to implement this for PostgreSQL.
--
Vik Fearing
Attachments:
0001-implement-GROUP-BY-DISTINCT.v01.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v01.patchDownload+284-17
On 2021.02.21. 13:52 Vik Fearing <vik@postgresfriends.org> wrote:
Attached is a patch to implement this for PostgreSQL.
[]
The changed line that gets stuffed into sql_features is missing a terminal value (to fill the 'comments' column).
This line:
'+T434 GROUP BY DISTINCT YES'
(A tab at the end will do, I suppose; that's how I fixed the patch locally)
Erik Rijkers
On 2/21/21 3:06 PM, er@xs4all.nl wrote:
On 2021.02.21. 13:52 Vik Fearing <vik@postgresfriends.org> wrote:
Attached is a patch to implement this for PostgreSQL.
[]The changed line that gets stuffed into sql_features is missing a terminal value (to fill the 'comments' column).
This line:
'+T434 GROUP BY DISTINCT YES'(A tab at the end will do, I suppose; that's how I fixed the patch locally)
Argh. Fixed.
Thank you for looking at it!
--
Vik Fearing
Attachments:
0001-implement-GROUP-BY-DISTINCT.v02.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v02.patchDownload+284-17
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested
Hi,
this is a useful feature, thank you for implementing. I gather that it follows the standard, if so,
then there are definitely no objections from me.
The patch in version 2, applies cleanly and passes all the tests.
It contains documentation which seems correct to a non native speaker.
As a minor gripe, I would note the addition of list_int_cmp.
The block
+ /* Sort each groupset individually */
+ foreach(cell, result)
+ list_sort(lfirst(cell), list_int_cmp);
Can follow suit from the rest of the code, and define a static cmp_list_int_asc(), as
indeed the same patch does for cmp_list_len_contents_asc.
This is indeed point of which I will not hold a too strong opinion.
Overall :+1: from me.
I will be bumping to 'Ready for Committer' unless objections.
On 3/2/21 4:06 PM, Georgios Kokolatos wrote:
As a minor gripe, I would note the addition of list_int_cmp.
The block+ /* Sort each groupset individually */ + foreach(cell, result) + list_sort(lfirst(cell), list_int_cmp);Can follow suit from the rest of the code, and define a static cmp_list_int_asc(), as
indeed the same patch does for cmp_list_len_contents_asc.
This is indeed point of which I will not hold a too strong opinion.
I did it this way because list_int_cmp is a general purpose function for
int lists that can be reused elsewhere in the future. Whereas
cmp_list_len_contents_asc is very specific to this case so I kept it local.
I'm happy to change it around if that's what consensus wants.
Overall :+1: from me.
Thanks for looking at it!
I will be bumping to 'Ready for Committer' unless objections.
In that case, here is another patch that fixes a typo in the docs
mentioned privately to me by Erik. The typo (and a gratuitous rebase)
is the only change to what you just reviewed.
--
Vik Fearing
Attachments:
0001-implement-GROUP-BY-DISTINCT.v03.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v03.patchDownload+284-17
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, March 2, 2021 5:51 PM, Vik Fearing <vik@postgresfriends.org> wrote:
On 3/2/21 4:06 PM, Georgios Kokolatos wrote:
As a minor gripe, I would note the addition of list_int_cmp.
The block- /* Sort each groupset individually */
- foreach(cell, result)
- list_sort(lfirst(cell), list_int_cmp);
Can follow suit from the rest of the code, and define a static cmp_list_int_asc(), as
indeed the same patch does for cmp_list_len_contents_asc.
This is indeed point of which I will not hold a too strong opinion.I did it this way because list_int_cmp is a general purpose function for
int lists that can be reused elsewhere in the future. Whereas
cmp_list_len_contents_asc is very specific to this case so I kept it local.
Of course. I got the intention and I have noted my opinion.
I'm happy to change it around if that's what consensus wants.
As before, I will not hold a too strong opinion.
Overall :+1: from me.
Thanks for looking at it!
I will be bumping to 'Ready for Committer' unless objections.
In that case, here is another patch that fixes a typo in the docs
mentioned privately to me by Erik. The typo (and a gratuitous rebase)
is the only change to what you just reviewed.
Thank you. The typo was indistiguishable to me too.
My :+1: stands for version 3 of the patch. Updating status in the
commitfest to reflect that.
//Georgios -- https://www.vmware.com
Show quoted text
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Vik Fearing
Hi Vik,
The patch seems quite ready, I have just two comments.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.
2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.
I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.
Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
---------- Forwarded message ---------
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
Date: Fri, Mar 12, 2021 at 11:33 PM
Subject: Re: GROUP BY DISTINCT
To: Vik Fearing <vik@postgresfriends.org>, Georgios Kokolatos <
gkokolatos@protonmail.com>, <pgsql-hackers@lists.postgresql.org>
Cc: Erik Rijkers <er@xs4all.nl>
Hi Vik,
The patch seems quite ready, I have just two comments.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.
.....
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
After reading the above thread in hackers, I noticed that the index does
not point to aggrgeate functions either and DISTINCT is not mentioned in
the aggregate functions page either:
https://www.postgresql.org/docs/current/functions-aggregate.html
Shouldn't it be mentioned with an example of COUNT(DISTINCT ...) or
aggregate_function(DISTINCT ...) in general ?
Best regards
Pantelis Theodosiou
On 3/13/21 12:33 AM, Tomas Vondra wrote:
Hi Vik,
The patch seems quite ready, I have just two comments.
Thanks for taking a look.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.
Good idea; I never think about the index.
2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.
There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.
I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.
I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.
Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.
I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.
I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.
New patch attached, rebased on 15639d5e8f.
--
Vik Fearing
Attachments:
0001-implement-GROUP-BY-DISTINCT.v04.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v04.patchDownload+333-28
On 3/16/21 9:21 AM, Vik Fearing wrote:
On 3/13/21 12:33 AM, Tomas Vondra wrote:
Hi Vik,
The patch seems quite ready, I have just two comments.
Thanks for taking a look.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.Good idea; I never think about the index.
2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.
I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.
I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.
I agree, that's much cleaner.
Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.
I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.
I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.
Oh!
New patch attached, rebased on 15639d5e8f.
Thanks. At this point it seems fine to me, no further comments.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 3/16/21 3:52 PM, Tomas Vondra wrote:
On 3/16/21 9:21 AM, Vik Fearing wrote:
On 3/13/21 12:33 AM, Tomas Vondra wrote:
Hi Vik,
The patch seems quite ready, I have just two comments.
Thanks for taking a look.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.Good idea; I never think about the index.
2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.I agree, that's much cleaner.
Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.Oh!
New patch attached, rebased on 15639d5e8f.
Thanks. At this point it seems fine to me, no further comments.
Pushed. Thanks for the patch.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi, I didn't think of including you in this suggestion.
Or the pdsql-docs was not the right list to post? I didn't want to mix it
with the GROUP BY DISTINCT patch.
Please check my suggestion.
Best regards
Pantelis Theodosiou
---------- Forwarded message ---------
From: Pantelis Theodosiou <ypercube@gmail.com>
Date: Sat, Mar 13, 2021 at 1:03 AM
Subject: Fwd: GROUP BY DISTINCT
To: <pgsql-docs@lists.postgresql.org>
---------- Forwarded message ---------
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
Date: Fri, Mar 12, 2021 at 11:33 PM
Subject: Re: GROUP BY DISTINCT
To: Vik Fearing <vik@postgresfriends.org>, Georgios Kokolatos <
gkokolatos@protonmail.com>, <pgsql-hackers@lists.postgresql.org>
Cc: Erik Rijkers <er@xs4all.nl>
Hi Vik,
The patch seems quite ready, I have just two comments.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.
.....
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
After reading the above thread in hackers, I noticed that the index does
not point to aggrgeate functions either and DISTINCT is not mentioned in
the aggregate functions page either:
https://www.postgresql.org/docs/current/functions-aggregate.html
Shouldn't it be mentioned with an example of COUNT(DISTINCT ...) or
aggregate_function(DISTINCT ...) in general ?
Best regards
Pantelis Theodosiou
Sorry, I'm not reading pgsql-docs very often, so I missed the post.
Yeah, we should probably add an indexterm to the other places too.
regards
On 3/18/21 7:03 PM, Pantelis Theodosiou wrote:
Hi, I didn't think of including you in this suggestion.
Or the pdsql-docs was not the right list to post? I didn't want to mix
it with the GROUP BY DISTINCT patch.Please check my suggestion.
Best regards
Pantelis Theodosiou---------- Forwarded message ---------
From: *Pantelis Theodosiou* <ypercube@gmail.com <mailto:ypercube@gmail.com>>
Date: Sat, Mar 13, 2021 at 1:03 AM
Subject: Fwd: GROUP BY DISTINCT
To: <pgsql-docs@lists.postgresql.org
<mailto:pgsql-docs@lists.postgresql.org>>---------- Forwarded message ---------
From: *Tomas Vondra* <tomas.vondra@enterprisedb.com
<mailto:tomas.vondra@enterprisedb.com>>
Date: Fri, Mar 12, 2021 at 11:33 PM
Subject: Re: GROUP BY DISTINCT
To: Vik Fearing <vik@postgresfriends.org
<mailto:vik@postgresfriends.org>>, Georgios Kokolatos
<gkokolatos@protonmail.com <mailto:gkokolatos@protonmail.com>>,
<pgsql-hackers@lists.postgresql.org
<mailto:pgsql-hackers@lists.postgresql.org>>
Cc: Erik Rijkers <er@xs4all.nl <mailto:er@xs4all.nl>>Hi Vik,
The patch seems quite ready, I have just two comments.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part......
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL CompanyAfter reading the above thread in hackers, I noticed that the index does
not point to aggrgeate functions either and DISTINCT is not mentioned in
the aggregate functions page
either: https://www.postgresql.org/docs/current/functions-aggregate.html
<https://www.postgresql.org/docs/current/functions-aggregate.html>
Shouldn't it be mentioned with an example of COUNT(DISTINCT ...) or
aggregate_function(DISTINCT ...) in general ?Best regards
Pantelis Theodosiou
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 3/18/21 6:25 PM, Tomas Vondra wrote:
On 3/16/21 3:52 PM, Tomas Vondra wrote:
On 3/16/21 9:21 AM, Vik Fearing wrote:
On 3/13/21 12:33 AM, Tomas Vondra wrote:
Hi Vik,
The patch seems quite ready, I have just two comments.
Thanks for taking a look.
1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.Good idea; I never think about the index.
2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.I agree, that's much cleaner.
Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.Oh!
New patch attached, rebased on 15639d5e8f.
Thanks. At this point it seems fine to me, no further comments.
Pushed. Thanks for the patch.
Hmmm, this seems to fail on lapwing with this error:
parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errors
That line is this:
foreach_delete_current(result, cell);
and I don't see how any of the values close by could be unused ...
The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Mar 19, 2021 at 8:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Hmmm, this seems to fail on lapwing with this error:
parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errorsThat line is this:
foreach_delete_current(result, cell);
and I don't see how any of the values close by could be unused ...
The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).
No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning. GCC 10. Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail. sidewinder (gcc 4.8) is not showing the warning.
On 3/18/21 10:02 PM, Thomas Munro wrote:
On Fri, Mar 19, 2021 at 8:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Hmmm, this seems to fail on lapwing with this error:
parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errorsThat line is this:
foreach_delete_current(result, cell);
and I don't see how any of the values close by could be unused ...
The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning. GCC 10. Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail. sidewinder (gcc 4.8) is not showing the warning.
Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Mar 19, 2021 at 10:14 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning. GCC 10. Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail. sidewinder (gcc 4.8) is not showing the warning.Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...
It's frustrating to have to do press-ups to fix a problem because a
zombie Debian 7 system is running with -Werror (though it's always
possible that it's telling us something interesting...). Anyway, I
think someone with a GCC < 4.8 compiler would have to investigate. I
was hoping to help, but none of my systems have one in easy-to-install
format...
Thomas Munro <thomas.munro@gmail.com> writes:
On Fri, Mar 19, 2021 at 10:14 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...
It's frustrating to have to do press-ups to fix a problem because a
zombie Debian 7 system is running with -Werror (though it's always
possible that it's telling us something interesting...). Anyway, I
think someone with a GCC < 4.8 compiler would have to investigate. I
was hoping to help, but none of my systems have one in easy-to-install
format...
Hmm ... prairiedog isn't showing the warning, but maybe gaur will.
I can take a look if nobody else is stepping up.
regards, tom lane
I wrote:
Hmm ... prairiedog isn't showing the warning, but maybe gaur will.
Bingo:
parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:5: warning: value computed is not used
This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.
regards, tom lane
I wrote:
This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.
Actually, the problem is pretty obvious after comparing this use
of foreach_delete_current() to every other one. I'm not sure why
the compiler warnings are phrased just as they are, but the fix
I just pushed does make 4.5 happy.
regards, tom lane