GROUP BY DISTINCT

Started by Vik Fearingalmost 5 years ago22 messages

vik@postgresfriends.org

almost 5 years ago

1 attachment(s)

When combining multiple grouping items, such as rollups and cubes, the
resulting flattened grouping sets can contain duplicate items. The
standard provides for this by allowing GROUP BY DISTINCT to deduplicate
them prior to doing the actual work.

For example:
GROUP BY ROLLUP (a,b), ROLLUP (a,c)

expands to the sets:
(a,b,c), (a,b), (a,b), (a,c), (a), (a), (a,c), (a), ()

but:
GROUP BY DISTINCT ROLLUP (a,b), ROLLUP (a,c)

expands to just the sets:
(a,b,c), (a,b), (a,c), (a), ()

Attached is a patch to implement this for PostgreSQL.
--
Vik Fearing

Attachments:

0001-implement-GROUP-BY-DISTINCT.v01.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v01.patchDownload

From cfb7fc82dd4da1ea0b985ab4b909fac100acc130 Mon Sep 17 00:00:00 2001
From: Vik Fearing <vik@postgresfriends.org>
Date: Sun, 21 Feb 2021 10:26:57 +0100
Subject: [PATCH] implement GROUP BY DISTINCT

---
 doc/src/sgml/queries.sgml                  |  41 ++++++++
 doc/src/sgml/ref/select.sgml               |   9 +-
 src/backend/catalog/sql_features.txt       |   2 +-
 src/backend/nodes/copyfuncs.c              |   2 +
 src/backend/nodes/equalfuncs.c             |   2 +
 src/backend/nodes/list.c                   |  16 +++
 src/backend/nodes/outfuncs.c               |   2 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/planner.c       |   2 +-
 src/backend/parser/analyze.c               |   1 +
 src/backend/parser/gram.y                  |  22 ++--
 src/backend/parser/parse_agg.c             |  58 ++++++++++-
 src/include/nodes/parsenodes.h             |   2 +
 src/include/nodes/pg_list.h                |   1 +
 src/include/parser/parse_agg.h             |   2 +-
 src/test/regress/expected/groupingsets.out | 111 +++++++++++++++++++++
 src/test/regress/sql/groupingsets.sql      |  26 +++++
 17 files changed, 284 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index 4741506eb5..3c8facff87 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1372,6 +1372,47 @@ GROUP BY GROUPING SETS (
 </programlisting>
    </para>
 
+   <para>
+    When specifying multiple grouping items together, the final set of grouping
+    sets might contain duplicates. For example:
+<programlisting>
+GROUP BY ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, b),
+    (a, c),
+    (a),
+    (a),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    If these duplicates are undesirable, they can be removed using the
+    <literal>DISTINCT</literal> clause directly on the <literal>GROUP BY</literal>.
+    Therefore:
+<programlisting>
+GROUP BY <emphasis>DISTINCT</emphasis> ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    This is not the same as using <literal>SELECT DISTINCT</literal> because the output
+    rows may still contain duplicates.  If any of the ungrouped columns contains NULL,
+    it will be indistiguishable from the NULL used when that same column is grouped.
+   </para>
+
   <note>
    <para>
     The construct <literal>(a, b)</literal> is normally recognized in expressions as
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index eb8b524951..0e2b2bef49 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -37,7 +37,7 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
     [ * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] [, ...] ]
     [ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ]
     [ WHERE <replaceable class="parameter">condition</replaceable> ]
-    [ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
+    [ GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
     [ HAVING <replaceable class="parameter">condition</replaceable> ]
     [ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ]
     [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] <replaceable class="parameter">select</replaceable> ]
@@ -776,7 +776,7 @@ WHERE <replaceable class="parameter">condition</replaceable>
    <para>
     The optional <literal>GROUP BY</literal> clause has the general form
 <synopsis>
-GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
+GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...]
 </synopsis>
    </para>
 
@@ -800,7 +800,10 @@ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
     independent <replaceable>grouping sets</replaceable>.  The effect of this is
     equivalent to constructing a <literal>UNION ALL</literal> between
     subqueries with the individual grouping sets as their
-    <literal>GROUP BY</literal> clauses.  For further details on the handling
+    <literal>GROUP BY</literal> clauses.  The optional <literal>DISTINCT</literal>
+    clause removes duplicate sets before processing; it does <emphasis>not</emphasis>
+    transform the <literal>UNION ALL</literal> into a <literal>UNION DISTINCT</literal>.
+    For further details on the handling
     of grouping sets see <xref linkend="queries-grouping-sets"/>.
    </para>
 
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index a24387c1e7..1ab7ef1f86 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -481,7 +481,7 @@ T351	Bracketed SQL comments (/*...*/ comments)			YES
 T431	Extended grouping capabilities			YES	
 T432	Nested and concatenated GROUPING SETS			YES	
 T433	Multiargument GROUPING function			YES	
-T434	GROUP BY DISTINCT			NO	
+T434	GROUP BY DISTINCT			YES
 T441	ABS and MOD functions			YES	
 T461	Symmetric BETWEEN predicate			YES	
 T471	Result sets return value			NO	
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 65bbc18ecb..9be432ea7b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3113,6 +3113,7 @@ _copyQuery(const Query *from)
 	COPY_NODE_FIELD(onConflict);
 	COPY_NODE_FIELD(returningList);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(havingQual);
 	COPY_NODE_FIELD(windowClause);
@@ -3199,6 +3200,7 @@ _copySelectStmt(const SelectStmt *from)
 	COPY_NODE_FIELD(fromClause);
 	COPY_NODE_FIELD(whereClause);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(havingClause);
 	COPY_NODE_FIELD(windowClause);
 	COPY_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index c2d73626fc..bc5e9e52fe 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -977,6 +977,7 @@ _equalQuery(const Query *a, const Query *b)
 	COMPARE_NODE_FIELD(onConflict);
 	COMPARE_NODE_FIELD(returningList);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(groupingSets);
 	COMPARE_NODE_FIELD(havingQual);
 	COMPARE_NODE_FIELD(windowClause);
@@ -1053,6 +1054,7 @@ _equalSelectStmt(const SelectStmt *a, const SelectStmt *b)
 	COMPARE_NODE_FIELD(fromClause);
 	COMPARE_NODE_FIELD(whereClause);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(havingClause);
 	COMPARE_NODE_FIELD(windowClause);
 	COMPARE_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index dbf6b30233..94fb236daf 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1506,6 +1506,22 @@ list_sort(List *list, list_sort_comparator cmp)
 		qsort(list->elements, len, sizeof(ListCell), (qsort_comparator) cmp);
 }
 
+/*
+ * list_sort comparator for sorting a list into ascending int order.
+ */
+int
+list_int_cmp(const ListCell *p1, const ListCell *p2)
+{
+	int			v1 = lfirst_int(p1);
+	int			v2 = lfirst_int(p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
 /*
  * list_sort comparator for sorting a list into ascending OID order.
  */
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f5dcedf6e8..7c7ef697f9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2758,6 +2758,7 @@ _outSelectStmt(StringInfo str, const SelectStmt *node)
 	WRITE_NODE_FIELD(fromClause);
 	WRITE_NODE_FIELD(whereClause);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(havingClause);
 	WRITE_NODE_FIELD(windowClause);
 	WRITE_NODE_FIELD(valuesLists);
@@ -2983,6 +2984,7 @@ _outQuery(StringInfo str, const Query *node)
 	WRITE_NODE_FIELD(onConflict);
 	WRITE_NODE_FIELD(returningList);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(havingQual);
 	WRITE_NODE_FIELD(windowClause);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4388aae71d..0bc9429d67 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -271,6 +271,7 @@ _readQuery(void)
 	READ_NODE_FIELD(onConflict);
 	READ_NODE_FIELD(returningList);
 	READ_NODE_FIELD(groupClause);
+	READ_BOOL_FIELD(groupDistinct);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(havingQual);
 	READ_NODE_FIELD(windowClause);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index adf68d8790..66bad90f60 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2427,7 +2427,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 	ListCell   *lc_set;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
-	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
+	parse->groupingSets = expand_grouping_sets(parse->groupingSets, parse->groupDistinct, -1);
 
 	gd->any_hashable = false;
 	gd->unhashable_refs = NULL;
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f3a70c49a..7149724953 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -1264,6 +1264,7 @@ transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
 											qry->sortClause,
 											EXPR_KIND_GROUP_BY,
 											false /* allow SQL92 rules */ );
+	qry->groupDistinct = stmt->groupDistinct;
 
 	if (stmt->distinctClause == NIL)
 	{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dd72a9fc3c..66e5401684 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -443,7 +443,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	for_locking_item
 %type <list>	for_locking_clause opt_for_locking_clause for_locking_items
 %type <list>	locked_rels_list
-%type <boolean>	all_or_distinct
+%type <boolean>	all_or_distinct distinct_or_all
 
 %type <node>	join_qual
 %type <jtype>	join_type
@@ -11294,7 +11294,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = linitial($7);
+					n->groupDistinct = (bool) intVal(lsecond($7));
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11309,7 +11310,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = linitial($7);
+					n->groupDistinct = (bool) intVal(lsecond($7));
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11531,12 +11533,19 @@ opt_table:	TABLE
 			| /*EMPTY*/
 		;
 
+/* The difference between these two is the default */
 all_or_distinct:
 			ALL										{ $$ = true; }
 			| DISTINCT								{ $$ = false; }
 			| /*EMPTY*/								{ $$ = false; }
 		;
 
+distinct_or_all:
+			DISTINCT								{ $$ = true; }
+			| ALL									{ $$ = false; }
+			| /*EMPTY*/								{ $$ = false; }
+		;
+
 /* We use (NIL) as a placeholder to indicate that all target expressions
  * should be placed in the DISTINCT list during parsetree analysis.
  */
@@ -11760,8 +11769,8 @@ first_or_next: FIRST_P								{ $$ = 0; }
  * GroupingSet node of some type.
  */
 group_clause:
-			GROUP_P BY group_by_list				{ $$ = $3; }
-			| /*EMPTY*/								{ $$ = NIL; }
+			GROUP_P BY distinct_or_all group_by_list	{ $$ = list_make2($4, makeInteger((int) $3)); }
+			| /*EMPTY*/									{ $$ = list_make2(NIL, makeInteger((int) false)); }
 		;
 
 group_by_list:
@@ -15134,7 +15143,8 @@ PLpgSQL_Expr: opt_distinct_clause opt_target_list
 					n->targetList = $2;
 					n->fromClause = $3;
 					n->whereClause = $4;
-					n->groupClause = $5;
+					n->groupClause = linitial($5);
+					n->groupDistinct = (bool) intVal(lsecond($5));
 					n->havingClause = $6;
 					n->windowClause = $7;
 					n->sortClause = $8;
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index fd08b9eeff..899327aaf4 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -1071,7 +1071,7 @@ parseCheckAggregates(ParseState *pstate, Query *qry)
 		 * The limit of 4096 is arbitrary and exists simply to avoid resource
 		 * issues from pathological constructs.
 		 */
-		List	   *gsets = expand_grouping_sets(qry->groupingSets, 4096);
+		List	   *gsets = expand_grouping_sets(qry->groupingSets, qry->groupDistinct, 4096);
 
 		if (!gsets)
 			ereport(ERROR,
@@ -1735,6 +1735,33 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
 	return (la > lb) ? 1 : (la == lb) ? 0 : -1;
 }
 
+/* list_sort comparator to sort sub-lists by length and contents */
+static int
+cmp_list_len_contents_asc(const ListCell *a, const ListCell *b)
+{
+	int		res = cmp_list_len_asc(a, b);
+
+	if (res == 0)
+	{
+		List		   *la = (List *) lfirst(a);
+		List		   *lb = (List *) lfirst(b);
+		ListCell	   *lca;
+		ListCell	   *lcb;
+
+		forboth(lca, la, lcb, lb)
+		{
+			int		va = intVal(lca);
+			int		vb = intVal(lcb);
+			if (va > vb)
+				return 1;
+			if (va < vb)
+				return -1;
+		}
+	}
+
+	return res;
+}
+
 /*
  * Expand a groupingSets clause to a flat list of grouping sets.
  * The returned list is sorted by length, shortest sets first.
@@ -1743,7 +1770,7 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
  * some consistency checks.
  */
 List *
-expand_grouping_sets(List *groupingSets, int limit)
+expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit)
 {
 	List	   *expanded_groups = NIL;
 	List	   *result = NIL;
@@ -1801,8 +1828,31 @@ expand_grouping_sets(List *groupingSets, int limit)
 		result = new_result;
 	}
 
-	/* Now sort the lists by length */
-	list_sort(result, cmp_list_len_asc);
+	/* Now sort the lists by length and deduplicate if necessary */
+	if (!groupDistinct || list_length(result) < 2)
+		list_sort(result, cmp_list_len_asc);
+	else
+	{
+		ListCell   *cell;
+		List	   *prev;
+
+		/* Sort each groupset individually */
+		foreach(cell, result)
+			list_sort(lfirst(cell), list_int_cmp);
+
+		/* Now sort the list of groupsets by length and contents */
+		list_sort(result, cmp_list_len_contents_asc);
+
+		/* Finally, remove duplicates */
+		prev = list_nth_node(List, result, 0);
+		for_each_from(cell, result, 1)
+		{
+			if (equal(lfirst(cell), prev))
+				foreach_delete_current(result, cell);
+			else
+				prev = lfirst(cell);
+		}
+	}
 
 	return result;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 236832a2ca..5af9f6bd9d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -146,6 +146,7 @@ typedef struct Query
 	List	   *returningList;	/* return-values list (of TargetEntry) */
 
 	List	   *groupClause;	/* a list of SortGroupClause's */
+	bool		groupDistinct;	/* is the group by clause distinct? */
 
 	List	   *groupingSets;	/* a list of GroupingSet's if present */
 
@@ -1629,6 +1630,7 @@ typedef struct SelectStmt
 	List	   *fromClause;		/* the FROM clause */
 	Node	   *whereClause;	/* WHERE qualification */
 	List	   *groupClause;	/* GROUP BY clauses */
+	bool		groupDistinct;	/* Is this GROUP BY DISTINCT? */
 	Node	   *havingClause;	/* HAVING conditional-expression */
 	List	   *windowClause;	/* WINDOW window_name AS (...), ... */
 
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 404e03f132..30f98c4595 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -604,6 +604,7 @@ extern pg_nodiscard List *list_copy_deep(const List *oldlist);
 typedef int (*list_sort_comparator) (const ListCell *a, const ListCell *b);
 extern void list_sort(List *list, list_sort_comparator cmp);
 
+extern int	list_int_cmp(const ListCell *p1, const ListCell *p2);
 extern int	list_oid_cmp(const ListCell *p1, const ListCell *p2);
 
 #endif							/* PG_LIST_H */
diff --git a/src/include/parser/parse_agg.h b/src/include/parser/parse_agg.h
index 0a2546c3ea..4dea01752a 100644
--- a/src/include/parser/parse_agg.h
+++ b/src/include/parser/parse_agg.h
@@ -26,7 +26,7 @@ extern void transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 
 extern void parseCheckAggregates(ParseState *pstate, Query *qry);
 
-extern List *expand_grouping_sets(List *groupingSets, int limit);
+extern List *expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit);
 
 extern int	get_aggregate_argtypes(Aggref *aggref, Oid *inputTypes);
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 7c844c6e09..4c467c1b15 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1929,4 +1929,115 @@ set work_mem to default;
 
 drop table gs_group_1;
 drop table gs_hash_1;
+-- GROUP BY DISTINCT
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(13 rows)
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(11 rows)
+
 -- end
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 18ae803e9d..3944944704 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -529,4 +529,30 @@ set work_mem to default;
 drop table gs_group_1;
 drop table gs_hash_1;
 
+-- GROUP BY DISTINCT
+
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
 -- end
-- 
2.25.1

Noname

er@xs4all.nl

almost 5 years ago

In reply to: Vik Fearing (#1)

Re: GROUP BY DISTINCT

On 2021.02.21. 13:52 Vik Fearing <vik@postgresfriends.org> wrote:

Attached is a patch to implement this for PostgreSQL.
[]

The changed line that gets stuffed into sql_features is missing a terminal value (to fill the 'comments' column).
This line:
'+T434 GROUP BY DISTINCT YES'

(A tab at the end will do, I suppose; that's how I fixed the patch locally)

Erik Rijkers

Vik Fearing

vik@postgresfriends.org

almost 5 years ago

In reply to: Noname (#2)

1 attachment(s)

Re: GROUP BY DISTINCT

On 2/21/21 3:06 PM, er@xs4all.nl wrote:

On 2021.02.21. 13:52 Vik Fearing <vik@postgresfriends.org> wrote:

Attached is a patch to implement this for PostgreSQL.
[]

The changed line that gets stuffed into sql_features is missing a terminal value (to fill the 'comments' column).
This line:
'+T434 GROUP BY DISTINCT YES'

(A tab at the end will do, I suppose; that's how I fixed the patch locally)

Argh. Fixed.

Thank you for looking at it!
--
Vik Fearing

Attachments:

0001-implement-GROUP-BY-DISTINCT.v02.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v02.patchDownload

From d5587ece129fbaf4309ddf48d44b9d0249d9cf56 Mon Sep 17 00:00:00 2001
From: Vik Fearing <vik@postgresfriends.org>
Date: Sun, 21 Feb 2021 10:26:57 +0100
Subject: [PATCH] implement GROUP BY DISTINCT

---
 doc/src/sgml/queries.sgml                  |  41 ++++++++
 doc/src/sgml/ref/select.sgml               |   9 +-
 src/backend/catalog/sql_features.txt       |   2 +-
 src/backend/nodes/copyfuncs.c              |   2 +
 src/backend/nodes/equalfuncs.c             |   2 +
 src/backend/nodes/list.c                   |  16 +++
 src/backend/nodes/outfuncs.c               |   2 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/planner.c       |   2 +-
 src/backend/parser/analyze.c               |   1 +
 src/backend/parser/gram.y                  |  22 ++--
 src/backend/parser/parse_agg.c             |  58 ++++++++++-
 src/include/nodes/parsenodes.h             |   2 +
 src/include/nodes/pg_list.h                |   1 +
 src/include/parser/parse_agg.h             |   2 +-
 src/test/regress/expected/groupingsets.out | 111 +++++++++++++++++++++
 src/test/regress/sql/groupingsets.sql      |  26 +++++
 17 files changed, 284 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index 4741506eb5..3c8facff87 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1372,6 +1372,47 @@ GROUP BY GROUPING SETS (
 </programlisting>
    </para>
 
+   <para>
+    When specifying multiple grouping items together, the final set of grouping
+    sets might contain duplicates. For example:
+<programlisting>
+GROUP BY ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, b),
+    (a, c),
+    (a),
+    (a),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    If these duplicates are undesirable, they can be removed using the
+    <literal>DISTINCT</literal> clause directly on the <literal>GROUP BY</literal>.
+    Therefore:
+<programlisting>
+GROUP BY <emphasis>DISTINCT</emphasis> ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    This is not the same as using <literal>SELECT DISTINCT</literal> because the output
+    rows may still contain duplicates.  If any of the ungrouped columns contains NULL,
+    it will be indistiguishable from the NULL used when that same column is grouped.
+   </para>
+
   <note>
    <para>
     The construct <literal>(a, b)</literal> is normally recognized in expressions as
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index eb8b524951..0e2b2bef49 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -37,7 +37,7 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
     [ * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] [, ...] ]
     [ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ]
     [ WHERE <replaceable class="parameter">condition</replaceable> ]
-    [ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
+    [ GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
     [ HAVING <replaceable class="parameter">condition</replaceable> ]
     [ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ]
     [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] <replaceable class="parameter">select</replaceable> ]
@@ -776,7 +776,7 @@ WHERE <replaceable class="parameter">condition</replaceable>
    <para>
     The optional <literal>GROUP BY</literal> clause has the general form
 <synopsis>
-GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
+GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...]
 </synopsis>
    </para>
 
@@ -800,7 +800,10 @@ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
     independent <replaceable>grouping sets</replaceable>.  The effect of this is
     equivalent to constructing a <literal>UNION ALL</literal> between
     subqueries with the individual grouping sets as their
-    <literal>GROUP BY</literal> clauses.  For further details on the handling
+    <literal>GROUP BY</literal> clauses.  The optional <literal>DISTINCT</literal>
+    clause removes duplicate sets before processing; it does <emphasis>not</emphasis>
+    transform the <literal>UNION ALL</literal> into a <literal>UNION DISTINCT</literal>.
+    For further details on the handling
     of grouping sets see <xref linkend="queries-grouping-sets"/>.
    </para>
 
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index a24387c1e7..2e111ba755 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -481,7 +481,7 @@ T351	Bracketed SQL comments (/*...*/ comments)			YES
 T431	Extended grouping capabilities			YES	
 T432	Nested and concatenated GROUPING SETS			YES	
 T433	Multiargument GROUPING function			YES	
-T434	GROUP BY DISTINCT			NO	
+T434	GROUP BY DISTINCT			YES	
 T441	ABS and MOD functions			YES	
 T461	Symmetric BETWEEN predicate			YES	
 T471	Result sets return value			NO	
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 65bbc18ecb..9be432ea7b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3113,6 +3113,7 @@ _copyQuery(const Query *from)
 	COPY_NODE_FIELD(onConflict);
 	COPY_NODE_FIELD(returningList);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(havingQual);
 	COPY_NODE_FIELD(windowClause);
@@ -3199,6 +3200,7 @@ _copySelectStmt(const SelectStmt *from)
 	COPY_NODE_FIELD(fromClause);
 	COPY_NODE_FIELD(whereClause);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(havingClause);
 	COPY_NODE_FIELD(windowClause);
 	COPY_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index c2d73626fc..bc5e9e52fe 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -977,6 +977,7 @@ _equalQuery(const Query *a, const Query *b)
 	COMPARE_NODE_FIELD(onConflict);
 	COMPARE_NODE_FIELD(returningList);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(groupingSets);
 	COMPARE_NODE_FIELD(havingQual);
 	COMPARE_NODE_FIELD(windowClause);
@@ -1053,6 +1054,7 @@ _equalSelectStmt(const SelectStmt *a, const SelectStmt *b)
 	COMPARE_NODE_FIELD(fromClause);
 	COMPARE_NODE_FIELD(whereClause);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(havingClause);
 	COMPARE_NODE_FIELD(windowClause);
 	COMPARE_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index dbf6b30233..94fb236daf 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1506,6 +1506,22 @@ list_sort(List *list, list_sort_comparator cmp)
 		qsort(list->elements, len, sizeof(ListCell), (qsort_comparator) cmp);
 }
 
+/*
+ * list_sort comparator for sorting a list into ascending int order.
+ */
+int
+list_int_cmp(const ListCell *p1, const ListCell *p2)
+{
+	int			v1 = lfirst_int(p1);
+	int			v2 = lfirst_int(p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
 /*
  * list_sort comparator for sorting a list into ascending OID order.
  */
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index f5dcedf6e8..7c7ef697f9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2758,6 +2758,7 @@ _outSelectStmt(StringInfo str, const SelectStmt *node)
 	WRITE_NODE_FIELD(fromClause);
 	WRITE_NODE_FIELD(whereClause);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(havingClause);
 	WRITE_NODE_FIELD(windowClause);
 	WRITE_NODE_FIELD(valuesLists);
@@ -2983,6 +2984,7 @@ _outQuery(StringInfo str, const Query *node)
 	WRITE_NODE_FIELD(onConflict);
 	WRITE_NODE_FIELD(returningList);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(havingQual);
 	WRITE_NODE_FIELD(windowClause);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 4388aae71d..0bc9429d67 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -271,6 +271,7 @@ _readQuery(void)
 	READ_NODE_FIELD(onConflict);
 	READ_NODE_FIELD(returningList);
 	READ_NODE_FIELD(groupClause);
+	READ_BOOL_FIELD(groupDistinct);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(havingQual);
 	READ_NODE_FIELD(windowClause);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index adf68d8790..66bad90f60 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2427,7 +2427,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 	ListCell   *lc_set;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
-	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
+	parse->groupingSets = expand_grouping_sets(parse->groupingSets, parse->groupDistinct, -1);
 
 	gd->any_hashable = false;
 	gd->unhashable_refs = NULL;
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f3a70c49a..7149724953 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -1264,6 +1264,7 @@ transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
 											qry->sortClause,
 											EXPR_KIND_GROUP_BY,
 											false /* allow SQL92 rules */ );
+	qry->groupDistinct = stmt->groupDistinct;
 
 	if (stmt->distinctClause == NIL)
 	{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dd72a9fc3c..66e5401684 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -443,7 +443,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	for_locking_item
 %type <list>	for_locking_clause opt_for_locking_clause for_locking_items
 %type <list>	locked_rels_list
-%type <boolean>	all_or_distinct
+%type <boolean>	all_or_distinct distinct_or_all
 
 %type <node>	join_qual
 %type <jtype>	join_type
@@ -11294,7 +11294,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = linitial($7);
+					n->groupDistinct = (bool) intVal(lsecond($7));
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11309,7 +11310,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = linitial($7);
+					n->groupDistinct = (bool) intVal(lsecond($7));
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11531,12 +11533,19 @@ opt_table:	TABLE
 			| /*EMPTY*/
 		;
 
+/* The difference between these two is the default */
 all_or_distinct:
 			ALL										{ $$ = true; }
 			| DISTINCT								{ $$ = false; }
 			| /*EMPTY*/								{ $$ = false; }
 		;
 
+distinct_or_all:
+			DISTINCT								{ $$ = true; }
+			| ALL									{ $$ = false; }
+			| /*EMPTY*/								{ $$ = false; }
+		;
+
 /* We use (NIL) as a placeholder to indicate that all target expressions
  * should be placed in the DISTINCT list during parsetree analysis.
  */
@@ -11760,8 +11769,8 @@ first_or_next: FIRST_P								{ $$ = 0; }
  * GroupingSet node of some type.
  */
 group_clause:
-			GROUP_P BY group_by_list				{ $$ = $3; }
-			| /*EMPTY*/								{ $$ = NIL; }
+			GROUP_P BY distinct_or_all group_by_list	{ $$ = list_make2($4, makeInteger((int) $3)); }
+			| /*EMPTY*/									{ $$ = list_make2(NIL, makeInteger((int) false)); }
 		;
 
 group_by_list:
@@ -15134,7 +15143,8 @@ PLpgSQL_Expr: opt_distinct_clause opt_target_list
 					n->targetList = $2;
 					n->fromClause = $3;
 					n->whereClause = $4;
-					n->groupClause = $5;
+					n->groupClause = linitial($5);
+					n->groupDistinct = (bool) intVal(lsecond($5));
 					n->havingClause = $6;
 					n->windowClause = $7;
 					n->sortClause = $8;
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index fd08b9eeff..899327aaf4 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -1071,7 +1071,7 @@ parseCheckAggregates(ParseState *pstate, Query *qry)
 		 * The limit of 4096 is arbitrary and exists simply to avoid resource
 		 * issues from pathological constructs.
 		 */
-		List	   *gsets = expand_grouping_sets(qry->groupingSets, 4096);
+		List	   *gsets = expand_grouping_sets(qry->groupingSets, qry->groupDistinct, 4096);
 
 		if (!gsets)
 			ereport(ERROR,
@@ -1735,6 +1735,33 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
 	return (la > lb) ? 1 : (la == lb) ? 0 : -1;
 }
 
+/* list_sort comparator to sort sub-lists by length and contents */
+static int
+cmp_list_len_contents_asc(const ListCell *a, const ListCell *b)
+{
+	int		res = cmp_list_len_asc(a, b);
+
+	if (res == 0)
+	{
+		List		   *la = (List *) lfirst(a);
+		List		   *lb = (List *) lfirst(b);
+		ListCell	   *lca;
+		ListCell	   *lcb;
+
+		forboth(lca, la, lcb, lb)
+		{
+			int		va = intVal(lca);
+			int		vb = intVal(lcb);
+			if (va > vb)
+				return 1;
+			if (va < vb)
+				return -1;
+		}
+	}
+
+	return res;
+}
+
 /*
  * Expand a groupingSets clause to a flat list of grouping sets.
  * The returned list is sorted by length, shortest sets first.
@@ -1743,7 +1770,7 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
  * some consistency checks.
  */
 List *
-expand_grouping_sets(List *groupingSets, int limit)
+expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit)
 {
 	List	   *expanded_groups = NIL;
 	List	   *result = NIL;
@@ -1801,8 +1828,31 @@ expand_grouping_sets(List *groupingSets, int limit)
 		result = new_result;
 	}
 
-	/* Now sort the lists by length */
-	list_sort(result, cmp_list_len_asc);
+	/* Now sort the lists by length and deduplicate if necessary */
+	if (!groupDistinct || list_length(result) < 2)
+		list_sort(result, cmp_list_len_asc);
+	else
+	{
+		ListCell   *cell;
+		List	   *prev;
+
+		/* Sort each groupset individually */
+		foreach(cell, result)
+			list_sort(lfirst(cell), list_int_cmp);
+
+		/* Now sort the list of groupsets by length and contents */
+		list_sort(result, cmp_list_len_contents_asc);
+
+		/* Finally, remove duplicates */
+		prev = list_nth_node(List, result, 0);
+		for_each_from(cell, result, 1)
+		{
+			if (equal(lfirst(cell), prev))
+				foreach_delete_current(result, cell);
+			else
+				prev = lfirst(cell);
+		}
+	}
 
 	return result;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 236832a2ca..5af9f6bd9d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -146,6 +146,7 @@ typedef struct Query
 	List	   *returningList;	/* return-values list (of TargetEntry) */
 
 	List	   *groupClause;	/* a list of SortGroupClause's */
+	bool		groupDistinct;	/* is the group by clause distinct? */
 
 	List	   *groupingSets;	/* a list of GroupingSet's if present */
 
@@ -1629,6 +1630,7 @@ typedef struct SelectStmt
 	List	   *fromClause;		/* the FROM clause */
 	Node	   *whereClause;	/* WHERE qualification */
 	List	   *groupClause;	/* GROUP BY clauses */
+	bool		groupDistinct;	/* Is this GROUP BY DISTINCT? */
 	Node	   *havingClause;	/* HAVING conditional-expression */
 	List	   *windowClause;	/* WINDOW window_name AS (...), ... */
 
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 404e03f132..30f98c4595 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -604,6 +604,7 @@ extern pg_nodiscard List *list_copy_deep(const List *oldlist);
 typedef int (*list_sort_comparator) (const ListCell *a, const ListCell *b);
 extern void list_sort(List *list, list_sort_comparator cmp);
 
+extern int	list_int_cmp(const ListCell *p1, const ListCell *p2);
 extern int	list_oid_cmp(const ListCell *p1, const ListCell *p2);
 
 #endif							/* PG_LIST_H */
diff --git a/src/include/parser/parse_agg.h b/src/include/parser/parse_agg.h
index 0a2546c3ea..4dea01752a 100644
--- a/src/include/parser/parse_agg.h
+++ b/src/include/parser/parse_agg.h
@@ -26,7 +26,7 @@ extern void transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 
 extern void parseCheckAggregates(ParseState *pstate, Query *qry);
 
-extern List *expand_grouping_sets(List *groupingSets, int limit);
+extern List *expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit);
 
 extern int	get_aggregate_argtypes(Aggref *aggref, Oid *inputTypes);
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 7c844c6e09..4c467c1b15 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1929,4 +1929,115 @@ set work_mem to default;
 
 drop table gs_group_1;
 drop table gs_hash_1;
+-- GROUP BY DISTINCT
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(13 rows)
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(11 rows)
+
 -- end
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 18ae803e9d..3944944704 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -529,4 +529,30 @@ set work_mem to default;
 drop table gs_group_1;
 drop table gs_hash_1;
 
+-- GROUP BY DISTINCT
+
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
 -- end
-- 
2.25.1

Georgios Kokolatos

gkokolatos@protonmail.com

almost 5 years ago

In reply to: Vik Fearing (#3)

Re: GROUP BY DISTINCT

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Hi,

this is a useful feature, thank you for implementing. I gather that it follows the standard, if so,
then there are definitely no objections from me.

The patch in version 2, applies cleanly and passes all the tests.
It contains documentation which seems correct to a non native speaker.

As a minor gripe, I would note the addition of list_int_cmp.
The block

+               /* Sort each groupset individually */
+               foreach(cell, result)
+                       list_sort(lfirst(cell), list_int_cmp);

Can follow suit from the rest of the code, and define a static cmp_list_int_asc(), as
indeed the same patch does for cmp_list_len_contents_asc.
This is indeed point of which I will not hold a too strong opinion.

Overall :+1: from me.

I will be bumping to 'Ready for Committer' unless objections.

Vik Fearing

vik@postgresfriends.org

almost 5 years ago

In reply to: Georgios Kokolatos (#4)

1 attachment(s)

Re: GROUP BY DISTINCT

On 3/2/21 4:06 PM, Georgios Kokolatos wrote:

As a minor gripe, I would note the addition of list_int_cmp.
The block
+               /* Sort each groupset individually */
+               foreach(cell, result)
+                       list_sort(lfirst(cell), list_int_cmp);
Can follow suit from the rest of the code, and define a static cmp_list_int_asc(), as
indeed the same patch does for cmp_list_len_contents_asc.
This is indeed point of which I will not hold a too strong opinion.

I did it this way because list_int_cmp is a general purpose function for
int lists that can be reused elsewhere in the future. Whereas
cmp_list_len_contents_asc is very specific to this case so I kept it local.

I'm happy to change it around if that's what consensus wants.

Overall :+1: from me.

Thanks for looking at it!

I will be bumping to 'Ready for Committer' unless objections.

In that case, here is another patch that fixes a typo in the docs
mentioned privately to me by Erik. The typo (and a gratuitous rebase)
is the only change to what you just reviewed.
--
Vik Fearing

Attachments:

0001-implement-GROUP-BY-DISTINCT.v03.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v03.patchDownload

From e8c0042267abc2dedea8ecca2c6bfde22227118d Mon Sep 17 00:00:00 2001
From: Vik Fearing <vik@postgresfriends.org>
Date: Sun, 21 Feb 2021 10:26:57 +0100
Subject: [PATCH] implement GROUP BY DISTINCT

---
 doc/src/sgml/queries.sgml                  |  41 ++++++++
 doc/src/sgml/ref/select.sgml               |   9 +-
 src/backend/catalog/sql_features.txt       |   2 +-
 src/backend/nodes/copyfuncs.c              |   2 +
 src/backend/nodes/equalfuncs.c             |   2 +
 src/backend/nodes/list.c                   |  16 +++
 src/backend/nodes/outfuncs.c               |   2 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/planner.c       |   2 +-
 src/backend/parser/analyze.c               |   1 +
 src/backend/parser/gram.y                  |  22 ++--
 src/backend/parser/parse_agg.c             |  58 ++++++++++-
 src/include/nodes/parsenodes.h             |   2 +
 src/include/nodes/pg_list.h                |   1 +
 src/include/parser/parse_agg.h             |   2 +-
 src/test/regress/expected/groupingsets.out | 111 +++++++++++++++++++++
 src/test/regress/sql/groupingsets.sql      |  26 +++++
 17 files changed, 284 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index bc0b3cc9fe..ac88df4391 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1372,6 +1372,47 @@ GROUP BY GROUPING SETS (
 </programlisting>
    </para>
 
+   <para>
+    When specifying multiple grouping items together, the final set of grouping
+    sets might contain duplicates. For example:
+<programlisting>
+GROUP BY ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, b),
+    (a, c),
+    (a),
+    (a),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    If these duplicates are undesirable, they can be removed using the
+    <literal>DISTINCT</literal> clause directly on the <literal>GROUP BY</literal>.
+    Therefore:
+<programlisting>
+GROUP BY <emphasis>DISTINCT</emphasis> ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    This is not the same as using <literal>SELECT DISTINCT</literal> because the output
+    rows may still contain duplicates.  If any of the ungrouped columns contains NULL,
+    it will be indistinguishable from the NULL used when that same column is grouped.
+   </para>
+
   <note>
    <para>
     The construct <literal>(a, b)</literal> is normally recognized in expressions as
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index ab91105599..9c5cf50ef0 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -37,7 +37,7 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
     [ * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] [, ...] ]
     [ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ]
     [ WHERE <replaceable class="parameter">condition</replaceable> ]
-    [ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
+    [ GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
     [ HAVING <replaceable class="parameter">condition</replaceable> ]
     [ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ]
     [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] <replaceable class="parameter">select</replaceable> ]
@@ -778,7 +778,7 @@ WHERE <replaceable class="parameter">condition</replaceable>
    <para>
     The optional <literal>GROUP BY</literal> clause has the general form
 <synopsis>
-GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
+GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...]
 </synopsis>
    </para>
 
@@ -802,7 +802,10 @@ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
     independent <replaceable>grouping sets</replaceable>.  The effect of this is
     equivalent to constructing a <literal>UNION ALL</literal> between
     subqueries with the individual grouping sets as their
-    <literal>GROUP BY</literal> clauses.  For further details on the handling
+    <literal>GROUP BY</literal> clauses.  The optional <literal>DISTINCT</literal>
+    clause removes duplicate sets before processing; it does <emphasis>not</emphasis>
+    transform the <literal>UNION ALL</literal> into a <literal>UNION DISTINCT</literal>.
+    For further details on the handling
     of grouping sets see <xref linkend="queries-grouping-sets"/>.
    </para>
 
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index ab0895ce3c..65b5bfdd69 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -482,7 +482,7 @@ T351	Bracketed SQL comments (/*...*/ comments)			YES
 T431	Extended grouping capabilities			YES	
 T432	Nested and concatenated GROUPING SETS			YES	
 T433	Multiargument GROUPING function			YES	
-T434	GROUP BY DISTINCT			NO	
+T434	GROUP BY DISTINCT			YES	
 T441	ABS and MOD functions			YES	
 T461	Symmetric BETWEEN predicate			YES	
 T471	Result sets return value			NO	
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index aaba1ec2c4..f3625ca761 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3134,6 +3134,7 @@ _copyQuery(const Query *from)
 	COPY_NODE_FIELD(onConflict);
 	COPY_NODE_FIELD(returningList);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(havingQual);
 	COPY_NODE_FIELD(windowClause);
@@ -3220,6 +3221,7 @@ _copySelectStmt(const SelectStmt *from)
 	COPY_NODE_FIELD(fromClause);
 	COPY_NODE_FIELD(whereClause);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(havingClause);
 	COPY_NODE_FIELD(windowClause);
 	COPY_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index c2d73626fc..bc5e9e52fe 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -977,6 +977,7 @@ _equalQuery(const Query *a, const Query *b)
 	COMPARE_NODE_FIELD(onConflict);
 	COMPARE_NODE_FIELD(returningList);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(groupingSets);
 	COMPARE_NODE_FIELD(havingQual);
 	COMPARE_NODE_FIELD(windowClause);
@@ -1053,6 +1054,7 @@ _equalSelectStmt(const SelectStmt *a, const SelectStmt *b)
 	COMPARE_NODE_FIELD(fromClause);
 	COMPARE_NODE_FIELD(whereClause);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(havingClause);
 	COMPARE_NODE_FIELD(windowClause);
 	COMPARE_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index dbf6b30233..94fb236daf 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1506,6 +1506,22 @@ list_sort(List *list, list_sort_comparator cmp)
 		qsort(list->elements, len, sizeof(ListCell), (qsort_comparator) cmp);
 }
 
+/*
+ * list_sort comparator for sorting a list into ascending int order.
+ */
+int
+list_int_cmp(const ListCell *p1, const ListCell *p2)
+{
+	int			v1 = lfirst_int(p1);
+	int			v2 = lfirst_int(p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
 /*
  * list_sort comparator for sorting a list into ascending OID order.
  */
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8fc432bfe1..2d8c487358 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2769,6 +2769,7 @@ _outSelectStmt(StringInfo str, const SelectStmt *node)
 	WRITE_NODE_FIELD(fromClause);
 	WRITE_NODE_FIELD(whereClause);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(havingClause);
 	WRITE_NODE_FIELD(windowClause);
 	WRITE_NODE_FIELD(valuesLists);
@@ -2994,6 +2995,7 @@ _outQuery(StringInfo str, const Query *node)
 	WRITE_NODE_FIELD(onConflict);
 	WRITE_NODE_FIELD(returningList);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(havingQual);
 	WRITE_NODE_FIELD(windowClause);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 718fb58e86..377185f7c6 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -271,6 +271,7 @@ _readQuery(void)
 	READ_NODE_FIELD(onConflict);
 	READ_NODE_FIELD(returningList);
 	READ_NODE_FIELD(groupClause);
+	READ_BOOL_FIELD(groupDistinct);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(havingQual);
 	READ_NODE_FIELD(windowClause);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 545b56bcaf..f529d107d2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2427,7 +2427,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 	ListCell   *lc_set;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
-	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
+	parse->groupingSets = expand_grouping_sets(parse->groupingSets, parse->groupDistinct, -1);
 
 	gd->any_hashable = false;
 	gd->unhashable_refs = NULL;
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f3a70c49a..7149724953 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -1264,6 +1264,7 @@ transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
 											qry->sortClause,
 											EXPR_KIND_GROUP_BY,
 											false /* allow SQL92 rules */ );
+	qry->groupDistinct = stmt->groupDistinct;
 
 	if (stmt->distinctClause == NIL)
 	{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 652be0b96d..f54e74134b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -443,7 +443,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	for_locking_item
 %type <list>	for_locking_clause opt_for_locking_clause for_locking_items
 %type <list>	locked_rels_list
-%type <boolean>	all_or_distinct
+%type <boolean>	all_or_distinct distinct_or_all
 
 %type <node>	join_qual
 %type <jtype>	join_type
@@ -11294,7 +11294,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = linitial($7);
+					n->groupDistinct = (bool) intVal(lsecond($7));
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11309,7 +11310,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = linitial($7);
+					n->groupDistinct = (bool) intVal(lsecond($7));
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11542,12 +11544,19 @@ opt_table:	TABLE
 			| /*EMPTY*/
 		;
 
+/* The difference between these two is the default */
 all_or_distinct:
 			ALL										{ $$ = true; }
 			| DISTINCT								{ $$ = false; }
 			| /*EMPTY*/								{ $$ = false; }
 		;
 
+distinct_or_all:
+			DISTINCT								{ $$ = true; }
+			| ALL									{ $$ = false; }
+			| /*EMPTY*/								{ $$ = false; }
+		;
+
 /* We use (NIL) as a placeholder to indicate that all target expressions
  * should be placed in the DISTINCT list during parsetree analysis.
  */
@@ -11771,8 +11780,8 @@ first_or_next: FIRST_P								{ $$ = 0; }
  * GroupingSet node of some type.
  */
 group_clause:
-			GROUP_P BY group_by_list				{ $$ = $3; }
-			| /*EMPTY*/								{ $$ = NIL; }
+			GROUP_P BY distinct_or_all group_by_list	{ $$ = list_make2($4, makeInteger((int) $3)); }
+			| /*EMPTY*/									{ $$ = list_make2(NIL, makeInteger((int) false)); }
 		;
 
 group_by_list:
@@ -15145,7 +15154,8 @@ PLpgSQL_Expr: opt_distinct_clause opt_target_list
 					n->targetList = $2;
 					n->fromClause = $3;
 					n->whereClause = $4;
-					n->groupClause = $5;
+					n->groupClause = linitial($5);
+					n->groupDistinct = (bool) intVal(lsecond($5));
 					n->havingClause = $6;
 					n->windowClause = $7;
 					n->sortClause = $8;
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index fd08b9eeff..899327aaf4 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -1071,7 +1071,7 @@ parseCheckAggregates(ParseState *pstate, Query *qry)
 		 * The limit of 4096 is arbitrary and exists simply to avoid resource
 		 * issues from pathological constructs.
 		 */
-		List	   *gsets = expand_grouping_sets(qry->groupingSets, 4096);
+		List	   *gsets = expand_grouping_sets(qry->groupingSets, qry->groupDistinct, 4096);
 
 		if (!gsets)
 			ereport(ERROR,
@@ -1735,6 +1735,33 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
 	return (la > lb) ? 1 : (la == lb) ? 0 : -1;
 }
 
+/* list_sort comparator to sort sub-lists by length and contents */
+static int
+cmp_list_len_contents_asc(const ListCell *a, const ListCell *b)
+{
+	int		res = cmp_list_len_asc(a, b);
+
+	if (res == 0)
+	{
+		List		   *la = (List *) lfirst(a);
+		List		   *lb = (List *) lfirst(b);
+		ListCell	   *lca;
+		ListCell	   *lcb;
+
+		forboth(lca, la, lcb, lb)
+		{
+			int		va = intVal(lca);
+			int		vb = intVal(lcb);
+			if (va > vb)
+				return 1;
+			if (va < vb)
+				return -1;
+		}
+	}
+
+	return res;
+}
+
 /*
  * Expand a groupingSets clause to a flat list of grouping sets.
  * The returned list is sorted by length, shortest sets first.
@@ -1743,7 +1770,7 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
  * some consistency checks.
  */
 List *
-expand_grouping_sets(List *groupingSets, int limit)
+expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit)
 {
 	List	   *expanded_groups = NIL;
 	List	   *result = NIL;
@@ -1801,8 +1828,31 @@ expand_grouping_sets(List *groupingSets, int limit)
 		result = new_result;
 	}
 
-	/* Now sort the lists by length */
-	list_sort(result, cmp_list_len_asc);
+	/* Now sort the lists by length and deduplicate if necessary */
+	if (!groupDistinct || list_length(result) < 2)
+		list_sort(result, cmp_list_len_asc);
+	else
+	{
+		ListCell   *cell;
+		List	   *prev;
+
+		/* Sort each groupset individually */
+		foreach(cell, result)
+			list_sort(lfirst(cell), list_int_cmp);
+
+		/* Now sort the list of groupsets by length and contents */
+		list_sort(result, cmp_list_len_contents_asc);
+
+		/* Finally, remove duplicates */
+		prev = list_nth_node(List, result, 0);
+		for_each_from(cell, result, 1)
+		{
+			if (equal(lfirst(cell), prev))
+				foreach_delete_current(result, cell);
+			else
+				prev = lfirst(cell);
+		}
+	}
 
 	return result;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 236832a2ca..5af9f6bd9d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -146,6 +146,7 @@ typedef struct Query
 	List	   *returningList;	/* return-values list (of TargetEntry) */
 
 	List	   *groupClause;	/* a list of SortGroupClause's */
+	bool		groupDistinct;	/* is the group by clause distinct? */
 
 	List	   *groupingSets;	/* a list of GroupingSet's if present */
 
@@ -1629,6 +1630,7 @@ typedef struct SelectStmt
 	List	   *fromClause;		/* the FROM clause */
 	Node	   *whereClause;	/* WHERE qualification */
 	List	   *groupClause;	/* GROUP BY clauses */
+	bool		groupDistinct;	/* Is this GROUP BY DISTINCT? */
 	Node	   *havingClause;	/* HAVING conditional-expression */
 	List	   *windowClause;	/* WINDOW window_name AS (...), ... */
 
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 404e03f132..30f98c4595 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -604,6 +604,7 @@ extern pg_nodiscard List *list_copy_deep(const List *oldlist);
 typedef int (*list_sort_comparator) (const ListCell *a, const ListCell *b);
 extern void list_sort(List *list, list_sort_comparator cmp);
 
+extern int	list_int_cmp(const ListCell *p1, const ListCell *p2);
 extern int	list_oid_cmp(const ListCell *p1, const ListCell *p2);
 
 #endif							/* PG_LIST_H */
diff --git a/src/include/parser/parse_agg.h b/src/include/parser/parse_agg.h
index 0a2546c3ea..4dea01752a 100644
--- a/src/include/parser/parse_agg.h
+++ b/src/include/parser/parse_agg.h
@@ -26,7 +26,7 @@ extern void transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 
 extern void parseCheckAggregates(ParseState *pstate, Query *qry);
 
-extern List *expand_grouping_sets(List *groupingSets, int limit);
+extern List *expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit);
 
 extern int	get_aggregate_argtypes(Aggref *aggref, Oid *inputTypes);
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 7c844c6e09..4c467c1b15 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1929,4 +1929,115 @@ set work_mem to default;
 
 drop table gs_group_1;
 drop table gs_hash_1;
+-- GROUP BY DISTINCT
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(13 rows)
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(11 rows)
+
 -- end
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 18ae803e9d..3944944704 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -529,4 +529,30 @@ set work_mem to default;
 drop table gs_group_1;
 drop table gs_hash_1;
 
+-- GROUP BY DISTINCT
+
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
 -- end
-- 
2.25.1

Georgios

gkokolatos@protonmail.com

almost 5 years ago

In reply to: Vik Fearing (#5)

Re: GROUP BY DISTINCT

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, March 2, 2021 5:51 PM, Vik Fearing <vik@postgresfriends.org> wrote:

On 3/2/21 4:06 PM, Georgios Kokolatos wrote:

As a minor gripe, I would note the addition of list_int_cmp.
The block

- /* Sort each groupset individually */

- foreach(cell, result)

- list_sort(lfirst(cell), list_int_cmp);

Can follow suit from the rest of the code, and define a static cmp_list_int_asc(), as
indeed the same patch does for cmp_list_len_contents_asc.
This is indeed point of which I will not hold a too strong opinion.

I did it this way because list_int_cmp is a general purpose function for
int lists that can be reused elsewhere in the future. Whereas
cmp_list_len_contents_asc is very specific to this case so I kept it local.

Of course. I got the intention and I have noted my opinion.

I'm happy to change it around if that's what consensus wants.

As before, I will not hold a too strong opinion.

Overall :+1: from me.

Thanks for looking at it!

I will be bumping to 'Ready for Committer' unless objections.

In that case, here is another patch that fixes a typo in the docs
mentioned privately to me by Erik. The typo (and a gratuitous rebase)
is the only change to what you just reviewed.

Thank you. The typo was indistiguishable to me too.

My :+1: stands for version 3 of the patch. Updating status in the
commitfest to reflect that.

//Georgios -- https://www.vmware.com

Show quoted text

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Vik Fearing

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Vik Fearing (#5)

Re: GROUP BY DISTINCT

Hi Vik,

The patch seems quite ready, I have just two comments.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.

I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.

Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Pantelis Theodosiou

ypercube@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#7)

Fwd: GROUP BY DISTINCT

---------- Forwarded message ---------
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
Date: Fri, Mar 12, 2021 at 11:33 PM
Subject: Re: GROUP BY DISTINCT
To: Vik Fearing <vik@postgresfriends.org>, Georgios Kokolatos <
gkokolatos@protonmail.com>, <pgsql-hackers@lists.postgresql.org>
Cc: Erik Rijkers <er@xs4all.nl>

Hi Vik,

The patch seems quite ready, I have just two comments.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

.....

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

After reading the above thread in hackers, I noticed that the index does
not point to aggrgeate functions either and DISTINCT is not mentioned in
the aggregate functions page either:
https://www.postgresql.org/docs/current/functions-aggregate.html
Shouldn't it be mentioned with an example of COUNT(DISTINCT ...) or
aggregate_function(DISTINCT ...) in general ?

Best regards

Pantelis Theodosiou

Vik Fearing

vik@postgresfriends.org

almost 5 years ago

In reply to: Tomas Vondra (#7)

1 attachment(s)

Re: GROUP BY DISTINCT

On 3/13/21 12:33 AM, Tomas Vondra wrote:

Hi Vik,

The patch seems quite ready, I have just two comments.

Thanks for taking a look.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

Good idea; I never think about the index.

2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.

There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.

I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.

I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.

Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.

I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.

I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.

New patch attached, rebased on 15639d5e8f.
--
Vik Fearing

Attachments:

0001-implement-GROUP-BY-DISTINCT.v04.patchtext/x-patch; charset=UTF-8; name=0001-implement-GROUP-BY-DISTINCT.v04.patchDownload

From d83e2237578f774aaf356525ce05e00b74081b8c Mon Sep 17 00:00:00 2001
From: Vik Fearing <vik@postgresfriends.org>
Date: Sun, 21 Feb 2021 10:26:57 +0100
Subject: [PATCH] implement GROUP BY DISTINCT

---
 doc/src/sgml/queries.sgml                  |  54 ++++++++++
 doc/src/sgml/ref/select.sgml               |   9 +-
 src/backend/catalog/sql_features.txt       |   2 +-
 src/backend/nodes/copyfuncs.c              |   2 +
 src/backend/nodes/equalfuncs.c             |   2 +
 src/backend/nodes/list.c                   |  16 +++
 src/backend/nodes/outfuncs.c               |   2 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/planner.c       |   2 +-
 src/backend/parser/analyze.c               |   1 +
 src/backend/parser/gram.y                  |  59 +++++++----
 src/backend/parser/parse_agg.c             |  58 ++++++++++-
 src/backend/utils/adt/ruleutils.c          |   2 +
 src/include/nodes/parsenodes.h             |  10 ++
 src/include/nodes/pg_list.h                |   1 +
 src/include/parser/parse_agg.h             |   2 +-
 src/test/regress/expected/groupingsets.out | 111 +++++++++++++++++++++
 src/test/regress/sql/groupingsets.sql      |  26 +++++
 18 files changed, 333 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index bc0b3cc9fe..834b83b509 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1372,6 +1372,55 @@ GROUP BY GROUPING SETS (
 </programlisting>
    </para>
 
+   <para>
+    <indexterm zone="queries-grouping-sets">
+     <primary>ALL</primary>
+     <secondary>GROUP BY ALL</secondary>
+    </indexterm>
+    <indexterm zone="queries-grouping-sets">
+     <primary>DISTINCT</primary>
+     <secondary>GROUP BY DISTINCT</secondary>
+    </indexterm>
+    When specifying multiple grouping items together, the final set of grouping
+    sets might contain duplicates. For example:
+<programlisting>
+GROUP BY ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, b),
+    (a, c),
+    (a),
+    (a),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    If these duplicates are undesirable, they can be removed using the
+    <literal>DISTINCT</literal> clause directly on the <literal>GROUP BY</literal>.
+    Therefore:
+<programlisting>
+GROUP BY <emphasis>DISTINCT</emphasis> ROLLUP (a, b), ROLLUP (a, c)
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+    (a, b, c),
+    (a, b),
+    (a, c),
+    (a),
+    ()
+)
+</programlisting>
+    This is not the same as using <literal>SELECT DISTINCT</literal> because the output
+    rows may still contain duplicates.  If any of the ungrouped columns contains NULL,
+    it will be indistinguishable from the NULL used when that same column is grouped.
+   </para>
+
   <note>
    <para>
     The construct <literal>(a, b)</literal> is normally recognized in expressions as
@@ -1560,8 +1609,13 @@ SELECT a "from", b + c AS sum FROM ...
   <sect2 id="queries-distinct">
    <title><literal>DISTINCT</literal></title>
 
+   <indexterm zone="queries-distinct">
+    <primary>ALL</primary>
+    <secondary>SELECT ALL</secondary>
+   </indexterm>
    <indexterm zone="queries-distinct">
     <primary>DISTINCT</primary>
+    <secondary>SELECT DISTINCT</secondary>
    </indexterm>
 
    <indexterm zone="queries-distinct">
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index ab91105599..9c5cf50ef0 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -37,7 +37,7 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
     [ * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] [, ...] ]
     [ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ]
     [ WHERE <replaceable class="parameter">condition</replaceable> ]
-    [ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
+    [ GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
     [ HAVING <replaceable class="parameter">condition</replaceable> ]
     [ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ]
     [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] <replaceable class="parameter">select</replaceable> ]
@@ -778,7 +778,7 @@ WHERE <replaceable class="parameter">condition</replaceable>
    <para>
     The optional <literal>GROUP BY</literal> clause has the general form
 <synopsis>
-GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
+GROUP BY [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...]
 </synopsis>
    </para>
 
@@ -802,7 +802,10 @@ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
     independent <replaceable>grouping sets</replaceable>.  The effect of this is
     equivalent to constructing a <literal>UNION ALL</literal> between
     subqueries with the individual grouping sets as their
-    <literal>GROUP BY</literal> clauses.  For further details on the handling
+    <literal>GROUP BY</literal> clauses.  The optional <literal>DISTINCT</literal>
+    clause removes duplicate sets before processing; it does <emphasis>not</emphasis>
+    transform the <literal>UNION ALL</literal> into a <literal>UNION DISTINCT</literal>.
+    For further details on the handling
     of grouping sets see <xref linkend="queries-grouping-sets"/>.
    </para>
 
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index 32eed988ab..b7165404cd 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -482,7 +482,7 @@ T351	Bracketed SQL comments (/*...*/ comments)			YES
 T431	Extended grouping capabilities			YES	
 T432	Nested and concatenated GROUPING SETS			YES	
 T433	Multiargument GROUPING function			YES	
-T434	GROUP BY DISTINCT			NO	
+T434	GROUP BY DISTINCT			YES	
 T441	ABS and MOD functions			YES	
 T461	Symmetric BETWEEN predicate			YES	
 T471	Result sets return value			NO	
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index da91cbd2b1..bda379ba91 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3135,6 +3135,7 @@ _copyQuery(const Query *from)
 	COPY_NODE_FIELD(onConflict);
 	COPY_NODE_FIELD(returningList);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(havingQual);
 	COPY_NODE_FIELD(windowClause);
@@ -3221,6 +3222,7 @@ _copySelectStmt(const SelectStmt *from)
 	COPY_NODE_FIELD(fromClause);
 	COPY_NODE_FIELD(whereClause);
 	COPY_NODE_FIELD(groupClause);
+	COPY_SCALAR_FIELD(groupDistinct);
 	COPY_NODE_FIELD(havingClause);
 	COPY_NODE_FIELD(windowClause);
 	COPY_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index c2d73626fc..bc5e9e52fe 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -977,6 +977,7 @@ _equalQuery(const Query *a, const Query *b)
 	COMPARE_NODE_FIELD(onConflict);
 	COMPARE_NODE_FIELD(returningList);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(groupingSets);
 	COMPARE_NODE_FIELD(havingQual);
 	COMPARE_NODE_FIELD(windowClause);
@@ -1053,6 +1054,7 @@ _equalSelectStmt(const SelectStmt *a, const SelectStmt *b)
 	COMPARE_NODE_FIELD(fromClause);
 	COMPARE_NODE_FIELD(whereClause);
 	COMPARE_NODE_FIELD(groupClause);
+	COMPARE_SCALAR_FIELD(groupDistinct);
 	COMPARE_NODE_FIELD(havingClause);
 	COMPARE_NODE_FIELD(windowClause);
 	COMPARE_NODE_FIELD(valuesLists);
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index dbf6b30233..94fb236daf 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1506,6 +1506,22 @@ list_sort(List *list, list_sort_comparator cmp)
 		qsort(list->elements, len, sizeof(ListCell), (qsort_comparator) cmp);
 }
 
+/*
+ * list_sort comparator for sorting a list into ascending int order.
+ */
+int
+list_int_cmp(const ListCell *p1, const ListCell *p2)
+{
+	int			v1 = lfirst_int(p1);
+	int			v2 = lfirst_int(p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
 /*
  * list_sort comparator for sorting a list into ascending OID order.
  */
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6493a03ff8..5054490c58 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2771,6 +2771,7 @@ _outSelectStmt(StringInfo str, const SelectStmt *node)
 	WRITE_NODE_FIELD(fromClause);
 	WRITE_NODE_FIELD(whereClause);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(havingClause);
 	WRITE_NODE_FIELD(windowClause);
 	WRITE_NODE_FIELD(valuesLists);
@@ -2996,6 +2997,7 @@ _outQuery(StringInfo str, const Query *node)
 	WRITE_NODE_FIELD(onConflict);
 	WRITE_NODE_FIELD(returningList);
 	WRITE_NODE_FIELD(groupClause);
+	WRITE_BOOL_FIELD(groupDistinct);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(havingQual);
 	WRITE_NODE_FIELD(windowClause);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index c5e136e9c3..9b8f81c523 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -271,6 +271,7 @@ _readQuery(void)
 	READ_NODE_FIELD(onConflict);
 	READ_NODE_FIELD(returningList);
 	READ_NODE_FIELD(groupClause);
+	READ_BOOL_FIELD(groupDistinct);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(havingQual);
 	READ_NODE_FIELD(windowClause);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 424d25cbd5..28b40dd905 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2442,7 +2442,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 	ListCell   *lc_set;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
-	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
+	parse->groupingSets = expand_grouping_sets(parse->groupingSets, parse->groupDistinct, -1);
 
 	gd->any_hashable = false;
 	gd->unhashable_refs = NULL;
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f3a70c49a..7149724953 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -1264,6 +1264,7 @@ transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
 											qry->sortClause,
 											EXPR_KIND_GROUP_BY,
 											false /* allow SQL92 rules */ );
+	qry->groupDistinct = stmt->groupDistinct;
 
 	if (stmt->distinctClause == NIL)
 	{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 652be0b96d..fd07e7107d 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -134,6 +134,13 @@ typedef struct SelectLimit
 	LimitOption limitOption;
 } SelectLimit;
 
+/* Private struct for the result of group_clause production */
+typedef struct GroupClause
+{
+	bool	distinct;
+	List   *list;
+} GroupClause;
+
 /* ConstraintAttributeSpec yields an integer bitmask of these flags: */
 #define CAS_NOT_DEFERRABLE			0x01
 #define CAS_DEFERRABLE				0x02
@@ -250,6 +257,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	PartitionBoundSpec	*partboundspec;
 	RoleSpec			*rolespec;
 	struct SelectLimit	*selectlimit;
+	SetQuantifier	 setquantifier;
+	struct GroupClause  *groupclause;
 }
 
 %type <node>	stmt schema_stmt
@@ -405,7 +414,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				target_list opt_target_list insert_column_list set_target_list
 				set_clause_list set_clause
 				def_list operator_def_list indirection opt_indirection
-				reloption_list group_clause TriggerFuncArgs opclass_item_list opclass_drop_list
+				reloption_list TriggerFuncArgs opclass_item_list opclass_drop_list
 				opclass_purpose opt_opfamily transaction_mode_list_or_empty
 				OptTableFuncElementList TableFuncElementList opt_type_modifiers
 				prep_type_clause
@@ -418,6 +427,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				vacuum_relation_list opt_vacuum_relation_list
 				drop_option_list
 
+%type <groupclause> group_clause
 %type <list>	group_by_list
 %type <node>	group_by_item empty_grouping_set rollup_clause cube_clause
 %type <node>	grouping_sets_clause
@@ -443,7 +453,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	for_locking_item
 %type <list>	for_locking_clause opt_for_locking_clause for_locking_items
 %type <list>	locked_rels_list
-%type <boolean>	all_or_distinct
+%type <setquantifier> set_quantifier
 
 %type <node>	join_qual
 %type <jtype>	join_type
@@ -11294,7 +11304,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = ($7)->list;
+					n->groupDistinct = ($7)->distinct;
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11309,7 +11320,8 @@ simple_select:
 					n->intoClause = $4;
 					n->fromClause = $5;
 					n->whereClause = $6;
-					n->groupClause = $7;
+					n->groupClause = ($7)->list;
+					n->groupDistinct = ($7)->distinct;
 					n->havingClause = $8;
 					n->windowClause = $9;
 					$$ = (Node *)n;
@@ -11334,17 +11346,17 @@ simple_select:
 					n->fromClause = list_make1($2);
 					$$ = (Node *)n;
 				}
-			| select_clause UNION all_or_distinct select_clause
+			| select_clause UNION set_quantifier select_clause
 				{
-					$$ = makeSetOp(SETOP_UNION, $3, $1, $4);
+					$$ = makeSetOp(SETOP_UNION, $3 == SET_QUANTIFIER_ALL, $1, $4);
 				}
-			| select_clause INTERSECT all_or_distinct select_clause
+			| select_clause INTERSECT set_quantifier select_clause
 				{
-					$$ = makeSetOp(SETOP_INTERSECT, $3, $1, $4);
+					$$ = makeSetOp(SETOP_INTERSECT, $3 == SET_QUANTIFIER_ALL, $1, $4);
 				}
-			| select_clause EXCEPT all_or_distinct select_clause
+			| select_clause EXCEPT set_quantifier select_clause
 				{
-					$$ = makeSetOp(SETOP_EXCEPT, $3, $1, $4);
+					$$ = makeSetOp(SETOP_EXCEPT, $3 == SET_QUANTIFIER_ALL, $1, $4);
 				}
 		;
 
@@ -11542,10 +11554,10 @@ opt_table:	TABLE
 			| /*EMPTY*/
 		;
 
-all_or_distinct:
-			ALL										{ $$ = true; }
-			| DISTINCT								{ $$ = false; }
-			| /*EMPTY*/								{ $$ = false; }
+set_quantifier:
+			ALL										{ $$ = SET_QUANTIFIER_ALL; }
+			| DISTINCT								{ $$ = SET_QUANTIFIER_DISTINCT; }
+			| /*EMPTY*/								{ $$ = SET_QUANTIFIER_DEFAULT; }
 		;
 
 /* We use (NIL) as a placeholder to indicate that all target expressions
@@ -11771,8 +11783,20 @@ first_or_next: FIRST_P								{ $$ = 0; }
  * GroupingSet node of some type.
  */
 group_clause:
-			GROUP_P BY group_by_list				{ $$ = $3; }
-			| /*EMPTY*/								{ $$ = NIL; }
+			GROUP_P BY set_quantifier group_by_list
+				{
+					GroupClause *n = (GroupClause *) palloc(sizeof(GroupClause));
+					n->distinct = $3 == SET_QUANTIFIER_DISTINCT;
+					n->list = $4;
+					$$ = n;
+				}
+			| /*EMPTY*/
+				{
+					GroupClause *n = (GroupClause *) palloc(sizeof(GroupClause));
+					n->distinct = false;
+					n->list = NIL;
+					$$ = n;
+				}
 		;
 
 group_by_list:
@@ -15145,7 +15169,8 @@ PLpgSQL_Expr: opt_distinct_clause opt_target_list
 					n->targetList = $2;
 					n->fromClause = $3;
 					n->whereClause = $4;
-					n->groupClause = $5;
+					n->groupClause = ($5)->list;
+					n->groupDistinct = ($5)->distinct;
 					n->havingClause = $6;
 					n->windowClause = $7;
 					n->sortClause = $8;
diff --git a/src/backend/parser/parse_agg.c b/src/backend/parser/parse_agg.c
index fd08b9eeff..899327aaf4 100644
--- a/src/backend/parser/parse_agg.c
+++ b/src/backend/parser/parse_agg.c
@@ -1071,7 +1071,7 @@ parseCheckAggregates(ParseState *pstate, Query *qry)
 		 * The limit of 4096 is arbitrary and exists simply to avoid resource
 		 * issues from pathological constructs.
 		 */
-		List	   *gsets = expand_grouping_sets(qry->groupingSets, 4096);
+		List	   *gsets = expand_grouping_sets(qry->groupingSets, qry->groupDistinct, 4096);
 
 		if (!gsets)
 			ereport(ERROR,
@@ -1735,6 +1735,33 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
 	return (la > lb) ? 1 : (la == lb) ? 0 : -1;
 }
 
+/* list_sort comparator to sort sub-lists by length and contents */
+static int
+cmp_list_len_contents_asc(const ListCell *a, const ListCell *b)
+{
+	int		res = cmp_list_len_asc(a, b);
+
+	if (res == 0)
+	{
+		List		   *la = (List *) lfirst(a);
+		List		   *lb = (List *) lfirst(b);
+		ListCell	   *lca;
+		ListCell	   *lcb;
+
+		forboth(lca, la, lcb, lb)
+		{
+			int		va = intVal(lca);
+			int		vb = intVal(lcb);
+			if (va > vb)
+				return 1;
+			if (va < vb)
+				return -1;
+		}
+	}
+
+	return res;
+}
+
 /*
  * Expand a groupingSets clause to a flat list of grouping sets.
  * The returned list is sorted by length, shortest sets first.
@@ -1743,7 +1770,7 @@ cmp_list_len_asc(const ListCell *a, const ListCell *b)
  * some consistency checks.
  */
 List *
-expand_grouping_sets(List *groupingSets, int limit)
+expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit)
 {
 	List	   *expanded_groups = NIL;
 	List	   *result = NIL;
@@ -1801,8 +1828,31 @@ expand_grouping_sets(List *groupingSets, int limit)
 		result = new_result;
 	}
 
-	/* Now sort the lists by length */
-	list_sort(result, cmp_list_len_asc);
+	/* Now sort the lists by length and deduplicate if necessary */
+	if (!groupDistinct || list_length(result) < 2)
+		list_sort(result, cmp_list_len_asc);
+	else
+	{
+		ListCell   *cell;
+		List	   *prev;
+
+		/* Sort each groupset individually */
+		foreach(cell, result)
+			list_sort(lfirst(cell), list_int_cmp);
+
+		/* Now sort the list of groupsets by length and contents */
+		list_sort(result, cmp_list_len_contents_asc);
+
+		/* Finally, remove duplicates */
+		prev = list_nth_node(List, result, 0);
+		for_each_from(cell, result, 1)
+		{
+			if (equal(lfirst(cell), prev))
+				foreach_delete_current(result, cell);
+			else
+				prev = lfirst(cell);
+		}
+	}
 
 	return result;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 879288c139..f0de2a25c9 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -5517,6 +5517,8 @@ get_basic_select_query(Query *query, deparse_context *context,
 
 		appendContextKeyword(context, " GROUP BY ",
 							 -PRETTYINDENT_STD, PRETTYINDENT_STD, 1);
+		if (query->groupDistinct)
+			appendStringInfoString(buf, "DISTINCT ");
 
 		save_exprkind = context->special_exprkind;
 		context->special_exprkind = EXPR_KIND_GROUP_BY;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 236832a2ca..3a81d4f267 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -62,6 +62,14 @@ typedef enum SortByNulls
 	SORTBY_NULLS_LAST
 } SortByNulls;
 
+/* Options for [ ALL | DISTINCT ] */
+typedef enum SetQuantifier
+{
+	SET_QUANTIFIER_DEFAULT,
+	SET_QUANTIFIER_ALL,
+	SET_QUANTIFIER_DISTINCT
+} SetQuantifier;
+
 /*
  * Grantable rights are encoded so that we can OR them together in a bitmask.
  * The present representation of AclItem limits us to 16 distinct rights,
@@ -146,6 +154,7 @@ typedef struct Query
 	List	   *returningList;	/* return-values list (of TargetEntry) */
 
 	List	   *groupClause;	/* a list of SortGroupClause's */
+	bool		groupDistinct;	/* is the group by clause distinct? */
 
 	List	   *groupingSets;	/* a list of GroupingSet's if present */
 
@@ -1629,6 +1638,7 @@ typedef struct SelectStmt
 	List	   *fromClause;		/* the FROM clause */
 	Node	   *whereClause;	/* WHERE qualification */
 	List	   *groupClause;	/* GROUP BY clauses */
+	bool		groupDistinct;	/* Is this GROUP BY DISTINCT? */
 	Node	   *havingClause;	/* HAVING conditional-expression */
 	List	   *windowClause;	/* WINDOW window_name AS (...), ... */
 
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 404e03f132..30f98c4595 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -604,6 +604,7 @@ extern pg_nodiscard List *list_copy_deep(const List *oldlist);
 typedef int (*list_sort_comparator) (const ListCell *a, const ListCell *b);
 extern void list_sort(List *list, list_sort_comparator cmp);
 
+extern int	list_int_cmp(const ListCell *p1, const ListCell *p2);
 extern int	list_oid_cmp(const ListCell *p1, const ListCell *p2);
 
 #endif							/* PG_LIST_H */
diff --git a/src/include/parser/parse_agg.h b/src/include/parser/parse_agg.h
index 0a2546c3ea..4dea01752a 100644
--- a/src/include/parser/parse_agg.h
+++ b/src/include/parser/parse_agg.h
@@ -26,7 +26,7 @@ extern void transformWindowFuncCall(ParseState *pstate, WindowFunc *wfunc,
 
 extern void parseCheckAggregates(ParseState *pstate, Query *qry);
 
-extern List *expand_grouping_sets(List *groupingSets, int limit);
+extern List *expand_grouping_sets(List *groupingSets, bool groupDistinct, int limit);
 
 extern int	get_aggregate_argtypes(Aggref *aggref, Oid *inputTypes);
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 7c844c6e09..4c467c1b15 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1929,4 +1929,115 @@ set work_mem to default;
 
 drop table gs_group_1;
 drop table gs_hash_1;
+-- GROUP BY DISTINCT
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   | 3
+ 1 |   |  
+ 1 |   |  
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   | 9
+ 7 |   |  
+ 7 |   |  
+ 7 |   |  
+   |   |  
+(25 rows)
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   | 6
+ 4 |   |  
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(13 rows)
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 3
+ 1 | 2 |  
+ 1 |   | 3
+ 1 |   |  
+ 4 |   | 6
+ 4 |   |  
+ 7 | 8 | 9
+ 7 | 8 |  
+ 7 |   | 9
+ 7 |   |  
+   |   |  
+(11 rows)
+
 -- end
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 18ae803e9d..3944944704 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -529,4 +529,30 @@ set work_mem to default;
 drop table gs_group_1;
 drop table gs_hash_1;
 
+-- GROUP BY DISTINCT
+
+-- "normal" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by all rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is also the default
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- "group by distinct" behavior...
+select a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by distinct rollup(a, b), rollup(a, c)
+order by a, b, c;
+
+-- ...which is not the same as "select distinct"
+select distinct a, b, c
+from (values (1, 2, 3), (4, null, 6), (7, 8, 9)) as t (a, b, c)
+group by rollup(a, b), rollup(a, c)
+order by a, b, c;
+
 -- end
-- 
2.25.1

#10

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Vik Fearing (#9)

Re: GROUP BY DISTINCT

On 3/16/21 9:21 AM, Vik Fearing wrote:

On 3/13/21 12:33 AM, Tomas Vondra wrote:

Hi Vik,

The patch seems quite ready, I have just two comments.

Thanks for taking a look.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

Good idea; I never think about the index.

2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.

There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.

I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.

I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.

I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.

I agree, that's much cleaner.

Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.

I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.

I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.

I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.

Oh!

New patch attached, rebased on 15639d5e8f.

Thanks. At this point it seems fine to me, no further comments.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#10)

Re: GROUP BY DISTINCT

On 3/16/21 3:52 PM, Tomas Vondra wrote:

On 3/16/21 9:21 AM, Vik Fearing wrote:

On 3/13/21 12:33 AM, Tomas Vondra wrote:

Hi Vik,

The patch seems quite ready, I have just two comments.

Thanks for taking a look.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

Good idea; I never think about the index.

2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.

There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.

I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.

I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.

I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.

I agree, that's much cleaner.

Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.

I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.

I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.

I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.

Oh!

New patch attached, rebased on 15639d5e8f.

Thanks. At this point it seems fine to me, no further comments.

Pushed. Thanks for the patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Pantelis Theodosiou

ypercube@gmail.com

almost 5 years ago

In reply to: Pantelis Theodosiou (#8)

DISTINCT term in aggregate function

Hi, I didn't think of including you in this suggestion.
Or the pdsql-docs was not the right list to post? I didn't want to mix it
with the GROUP BY DISTINCT patch.

Please check my suggestion.

Best regards
Pantelis Theodosiou

---------- Forwarded message ---------
From: Pantelis Theodosiou <ypercube@gmail.com>
Date: Sat, Mar 13, 2021 at 1:03 AM
Subject: Fwd: GROUP BY DISTINCT
To: <pgsql-docs@lists.postgresql.org>

Hi Vik,

The patch seems quite ready, I have just two comments.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

.....

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Best regards

Pantelis Theodosiou

#13

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Pantelis Theodosiou (#12)

Re: DISTINCT term in aggregate function

Sorry, I'm not reading pgsql-docs very often, so I missed the post.
Yeah, we should probably add an indexterm to the other places too.

regards

On 3/18/21 7:03 PM, Pantelis Theodosiou wrote:

Hi, I didn't think of including you in this suggestion.
Or the pdsql-docs was not the right list to post? I didn't want to mix
it with the GROUP BY DISTINCT patch.

Please check my suggestion.

Best regards
Pantelis Theodosiou

---------- Forwarded message ---------
From: *Pantelis Theodosiou* <ypercube@gmail.com <mailto:ypercube@gmail.com>>
Date: Sat, Mar 13, 2021 at 1:03 AM
Subject: Fwd: GROUP BY DISTINCT
To: <pgsql-docs@lists.postgresql.org
<mailto:pgsql-docs@lists.postgresql.org>>

---------- Forwarded message ---------
From: *Tomas Vondra* <tomas.vondra@enterprisedb.com
<mailto:tomas.vondra@enterprisedb.com>>
Date: Fri, Mar 12, 2021 at 11:33 PM
Subject: Re: GROUP BY DISTINCT
To: Vik Fearing <vik@postgresfriends.org
<mailto:vik@postgresfriends.org>>, Georgios Kokolatos
<gkokolatos@protonmail.com <mailto:gkokolatos@protonmail.com>>,
<pgsql-hackers@lists.postgresql.org
<mailto:pgsql-hackers@lists.postgresql.org>>
Cc: Erik Rijkers <er@xs4all.nl <mailto:er@xs4all.nl>>

Hi Vik,

The patch seems quite ready, I have just two comments.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

.....

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL Company

After reading the above thread in hackers, I noticed that the index does
not point to aggrgeate functions either and DISTINCT is not mentioned in
the aggregate functions page
either: https://www.postgresql.org/docs/current/functions-aggregate.html
<https://www.postgresql.org/docs/current/functions-aggregate.html>
Shouldn't it be mentioned with an example of COUNT(DISTINCT ...) or
aggregate_function(DISTINCT ...) in general ?

Best regards

Pantelis Theodosiou

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tomas Vondra (#11)

Re: GROUP BY DISTINCT

On 3/18/21 6:25 PM, Tomas Vondra wrote:

On 3/16/21 3:52 PM, Tomas Vondra wrote:

On 3/16/21 9:21 AM, Vik Fearing wrote:

On 3/13/21 12:33 AM, Tomas Vondra wrote:

Hi Vik,

The patch seems quite ready, I have just two comments.

Thanks for taking a look.

1) Shouldn't this add another <indexterm> for DISTINCT, somewhere in the
documentation? Now the index points just to the SELECT DISTINCT part.

Good idea; I never think about the index.

2) The part in gram.y that wraps/unwraps the boolean flag as an integer,
in order to stash it in the group lists is rather ugly, IMHO. It forces
all the places handling the list to be aware of this (there are not
many, but still ...). And there are no other places doing (bool) intVal
so it's not like there's a precedent for this.

There is kind of a precedent for it, I was copying off of TriggerEvents
and func_alias_clause.

I see. I was looking for "(bool) intVal" but you're right TriggerEvents
code does something similar.

I think the clean solution is to make group_clause produce a struct with
two fields, and just use that. Not sure how invasive that will be
outside gram.y, though.

I didn't want to create a whole new parse node for it, but Andrew Gierth
pointed me towards SelectLimit so I did it like that and I agree it is
much cleaner.

I agree, that's much cleaner.

Also, the all_or_distinct vs. distinct_or_all seems a bit error-prone. I
wonder if we can come up with some clearer names, describing the context
of those types.

I turned this into an enum for ALL/DISTINCT/default and the caller can
choose what it wants to do with default. I think that's a lot cleaner,
too. Maybe DISTINCT ON should be changed to fit in that? I left it
alone for now.

I think DISTINCT ON is a different kind of animal, because that is a
list of expressions, not just a simple enum state.

I also snuck in something that all of us overlooked which is outputting
the DISTINCT in ruleutils.c. I didn't add a test for it but that would
have been an unfortunate bug.

Oh!

New patch attached, rebased on 15639d5e8f.

Thanks. At this point it seems fine to me, no further comments.

Pushed. Thanks for the patch.

Hmmm, this seems to fail on lapwing with this error:

parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errors

That line is this:

foreach_delete_current(result, cell);

and I don't see how any of the values close by could be unused ...

The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#14)

Re: GROUP BY DISTINCT

On Fri, Mar 19, 2021 at 8:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hmmm, this seems to fail on lapwing with this error:

parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errors

That line is this:

foreach_delete_current(result, cell);

and I don't see how any of the values close by could be unused ...

The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).

No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning. GCC 10. Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail. sidewinder (gcc 4.8) is not showing the warning.

#16

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Thomas Munro (#15)

Re: GROUP BY DISTINCT

On 3/18/21 10:02 PM, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 8:27 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hmmm, this seems to fail on lapwing with this error:

parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:23: error: value computed is not used
[-Werror=unused-value]
cc1: all warnings being treated as errors

That line is this:

foreach_delete_current(result, cell);

and I don't see how any of the values close by could be unused ...

The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).

No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning. GCC 10. Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail. sidewinder (gcc 4.8) is not showing the warning.

Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#16)

Re: GROUP BY DISTINCT

On Fri, Mar 19, 2021 at 10:14 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

The only possibility I can think of is some sort of issue in the old-ish
gcc release (4.7.2).

No sure what's going on there, but data points: I tried a 32 bit build
here (that's the other special thing about lapwing) and didn't see the
warning. GCC 10. Also curculio (gcc 4.2) and snapper (gcc 4.7) are
also showing this warning, but they don't have -Werror so they don't
fail. sidewinder (gcc 4.8) is not showing the warning.

Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...

It's frustrating to have to do press-ups to fix a problem because a
zombie Debian 7 system is running with -Werror (though it's always
possible that it's telling us something interesting...). Anyway, I
think someone with a GCC < 4.8 compiler would have to investigate. I
was hoping to help, but none of my systems have one in easy-to-install
format...

#18

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#17)

Re: GROUP BY DISTINCT

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 19, 2021 at 10:14 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Thanks for the info. So it's likely related to older gcc releases. The
question is how to tweak the code to get rid of this ...

It's frustrating to have to do press-ups to fix a problem because a
zombie Debian 7 system is running with -Werror (though it's always
possible that it's telling us something interesting...). Anyway, I
think someone with a GCC < 4.8 compiler would have to investigate. I
was hoping to help, but none of my systems have one in easy-to-install
format...

Hmm ... prairiedog isn't showing the warning, but maybe gaur will.
I can take a look if nobody else is stepping up.

regards, tom lane

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#18)

Re: GROUP BY DISTINCT

I wrote:

Hmm ... prairiedog isn't showing the warning, but maybe gaur will.

Bingo:

parse_agg.c: In function 'expand_grouping_sets':
parse_agg.c:1851:5: warning: value computed is not used

This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.

regards, tom lane

#20

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#19)

Re: GROUP BY DISTINCT

I wrote:

This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.

Actually, the problem is pretty obvious after comparing this use
of foreach_delete_current() to every other one. I'm not sure why
the compiler warnings are phrased just as they are, but the fix
I just pushed does make 4.5 happy.

regards, tom lane

#21

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Tom Lane (#20)

Re: GROUP BY DISTINCT

On 3/19/21 12:26 AM, Tom Lane wrote:

I wrote:

This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.

Actually, the problem is pretty obvious after comparing this use
of foreach_delete_current() to every other one. I'm not sure why
the compiler warnings are phrased just as they are, but the fix
I just pushed does make 4.5 happy.

Thanks! Yeah, that looks obvious. Funny the older compilers noticed
this, not the new fancy ones.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Vik Fearing

vik@postgresfriends.org

almost 5 years ago

In reply to: Tomas Vondra (#21)

Re: GROUP BY DISTINCT

On 3/19/21 12:52 AM, Tomas Vondra wrote:

On 3/19/21 12:26 AM, Tom Lane wrote:

I wrote:

This is gcc 4.5, but hopefully whatever shuts it up will also work on 4.7.
I'll work on figuring that out.

Actually, the problem is pretty obvious after comparing this use
of foreach_delete_current() to every other one. I'm not sure why
the compiler warnings are phrased just as they are, but the fix
I just pushed does make 4.5 happy.

Thanks! Yeah, that looks obvious. Funny the older compilers noticed
this, not the new fancy ones.

I'm glad the buildfarm is so diverse.
--
Vik Fearing