Remove useless GROUP BY columns considering unique index

Started by Zhang Mingliabout 2 years ago24 messages
#1Zhang Mingli
zmlpostgres@gmail.com
1 attachment(s)

Hi,

This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0]/messages/by-id/CAApHDvrdYa=VhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL+PYMVN8OnQ@mail.gmail.com which points out that
unique index whose columns all have NOT NULL constraints  could also take the work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

Ex:

create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));

 explain (costs off) select * from t2 group by a,b,c;
 QUERY PLAN
 ----------------------
 HashAggregate
 Group Key: c
 -> Seq Scan on t2

The plan drop column a, b as I did a little more.

For the query, as t2 has primary key (a, b), before this patch, we could drop column c because {a, b} are PK.
And  we have an unique  index(c) with NOT NULL constraint now, we could drop column {a, b}, just keep {c}.

While we have multiple choices, group by a, b (c is removed  by PK) and group by c (a, b are removed by unique not null index)
And  I implement it to choose the one with less columns so that we can drop as more columns as possible.
I think it’s good for planner to save some cost like Sort less columns.
There may be better one for some reason like: try to keep PK for planner?
I’m not sure about that and it seems not worth much complex.

The NOT NULL constraint may also be computed from primary keys, ex:
create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(a));
Primary key(a, b) ensure a is NOT NULL and we have a unique index(a), but it will introduce more complex to check if a unique index could be used.
I also doubt it worths doing that..
So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
-- Test primary key beats unique not null index.
explain (costs off) select * from t3 group by a,b,c,d;
 QUERY PLAN
----------------------
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t3
(3 rows)

create temp table t4 (a int, b int not null, c int not null, d int not null, primary key (a, b), unique(b, c), unique(d));
-- Test unique not null index with less columns wins.
explain (costs off) select * from t4 group by a,b,c,d;
 QUERY PLAN
----------------------
 HashAggregate
 Group Key: d
 -> Seq Scan on t4
(3 rows)

The unique Indices could have overlaps with primary keys and indices themselves.

create temp table t5 (a int not null, b int not null, c int not null, d int not null, unique (a, b), unique(b, c), unique(a, c, d));
-- Test unique not null indices have overlap.
explain (costs off) select * from t5 group by a,b,c,d;
 QUERY PLAN
----------------------
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t5
(3 rows)

[0]: /messages/by-id/CAApHDvrdYa=VhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL+PYMVN8OnQ@mail.gmail.com

Zhang Mingli
www.hashdata.xyz

Attachments:

v1-0001-Remove-useless-GROUP-BY-columns-considering-unique-i.patchapplication/octet-streamDownload
From ff0920780902a9b334297d8272d9c13b11e6a5ae Mon Sep 17 00:00:00 2001
From: Zhang Mingli <avamingli@gmail.com>
Date: Fri, 29 Dec 2023 22:14:05 +0800
Subject: [PATCH] Remove useless GROUP BY columns considering unique index.

Before this commit, we try to remove any columns in the GROUP
BY clause that are redundant due to being functionally dependent
on other GROUP BY columns(primary key).

But unique index whose columns all have NOT NULL constraint could
also take the work.

create temp table t2 (a int, b int, c int not null,
 primary key (a, b), unique(c));

explain (costs off) select * from t2 group by a,b,c;
      QUERY PLAN
----------------------
 HashAggregate
   Group Key: c
   ->  Seq Scan on t2

There may be multiple unique index candidates, choose the one with
least column numbers so that we will drop more useless columns.

Authored-by: Zhang Mingli avamingli@gmail.com
---
 src/backend/catalog/pg_constraint.c      | 113 ++++++++++++++++++++++-
 src/backend/commands/tablecmds.c         |   2 +-
 src/backend/optimizer/plan/planner.c     |  28 +++++-
 src/backend/parser/parse_utilcmd.c       |   4 +-
 src/include/catalog/pg_constraint.h      |   4 +-
 src/test/regress/expected/aggregates.out |  58 ++++++++++++
 src/test/regress/sql/aggregates.sql      |  30 ++++++
 7 files changed, 229 insertions(+), 10 deletions(-)

diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index e9d4d6006e..9c579d4214 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -788,11 +788,11 @@ AdjustNotNullInheritance(Oid relid, Bitmapset *columns, int count)
  *
  * This is seldom needed, so we just scan pg_constraint each time.
  *
- * XXX This is only used to create derived tables, so NO INHERIT constraints
- * are always skipped.
+ * XXX When used to create derived tables, set skip_no_inherit tp true,
+ * so that NO INHERIT constraints will be skipped.
  */
 List *
-RelationGetNotNullConstraints(Oid relid, bool cooked)
+RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit)
 {
 	List	   *notnulls = NIL;
 	Relation	constrRel;
@@ -815,7 +815,7 @@ RelationGetNotNullConstraints(Oid relid, bool cooked)
 
 		if (conForm->contype != CONSTRAINT_NOTNULL)
 			continue;
-		if (conForm->connoinherit)
+		if (skip_no_inherit && conForm->connoinherit)
 			continue;
 
 		colnum = extractNotNullColumn(htup);
@@ -1658,3 +1658,108 @@ check_functional_grouping(Oid relid,
 
 	return false;
 }
+
+/*
+ * get_min_unique_not_null_attnos
+ *
+ * Identify the columns in a relation's unique index whose columns all have
+ * the NOT NULL constraint, if any.
+ * If there is more than one, return the one with the least column numbers.
+ */
+Bitmapset*
+get_min_unique_not_null_attnos(Oid relid)
+{
+	List	   *not_null_attnos = NIL;
+	ListCell   *lc = NULL;
+	Bitmapset  *unique_notnull_attnos = NULL;
+	Relation	pg_constraint;
+	HeapTuple	tuple;
+	SysScanDesc scan;
+	ScanKeyData skey[1];
+	int 		min_numkeys = 0;
+
+	/* Use cooked to fetch attnos. */
+	List* not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return unique_notnull_attnos;
+	/* Fill a attno array to use it easily */
+	foreach (lc, not_null_cs)
+	{
+		CookedConstraint* cc = lfirst(lc);
+		not_null_attnos = list_append_unique_int(not_null_attnos, cc->attnum);
+	}
+
+	/* Scan pg_constraint for constraints of the target rel */
+	pg_constraint = table_open(ConstraintRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_constraint_conrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scan = systable_beginscan(pg_constraint, ConstraintRelidTypidNameIndexId, true,
+							  NULL, 1, skey);
+
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+		Datum		adatum;
+		bool		isNull;
+		ArrayType  *arr;
+		int16	   *attnums;
+		int			numkeys;
+		int			i;
+		bool all_notnull;
+		Bitmapset  *candidate = NULL;
+
+		/* Skip constraints that are not UNIQUE */
+		if (con->contype != CONSTRAINT_UNIQUE)
+			continue;
+
+		if (con->condeferrable)
+			break;
+
+		/* Extract the conkey array, ie, attnums of UNIQUE's columns */
+		adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+							  RelationGetDescr(pg_constraint), &isNull);
+		if (isNull)
+			elog(ERROR, "null conkey for constraint %u",
+				 ((Form_pg_constraint) GETSTRUCT(tuple))->oid);
+		arr = DatumGetArrayTypeP(adatum);	/* ensure not toasted */
+		numkeys = ARR_DIMS(arr)[0];
+		if (ARR_NDIM(arr) != 1 ||
+			numkeys < 0 ||
+			ARR_HASNULL(arr) ||
+			ARR_ELEMTYPE(arr) != INT2OID)
+			elog(ERROR, "conkey is not a 1-D smallint array");
+		attnums = (int16 *) ARR_DATA_PTR(arr);
+
+		/* Skip if it has more columns than current one. */
+		if (min_numkeys > 0 && numkeys >= min_numkeys)
+			continue;
+
+		all_notnull = true;
+		/* Construct the result value */
+		for (i = 0; i < numkeys; i++)
+		{
+			/* Skip if unique key could be NULL. */
+			if (!list_member_int(not_null_attnos, attnums[i]))
+			{
+				all_notnull = false;
+				break;
+			}
+
+			candidate = bms_add_member(candidate,
+									  attnums[i] - FirstLowInvalidHeapAttributeNumber);
+		}
+		if(!all_notnull)
+			continue;
+		min_numkeys = numkeys;
+		unique_notnull_attnos = candidate;
+	}
+	systable_endscan(scan);
+
+	table_close(pg_constraint, AccessShareLock);
+
+	return unique_notnull_attnos;
+}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6b0a20010e..8797386326 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2682,7 +2682,7 @@ MergeAttributes(List *columns, const List *supers, char relpersistence,
 		 */
 		pkattrs = RelationGetIndexAttrBitmap(relation,
 											 INDEX_ATTR_BITMAP_PRIMARY_KEY);
-		nnconstrs = RelationGetNotNullConstraints(RelationGetRelid(relation), true);
+		nnconstrs = RelationGetNotNullConstraints(RelationGetRelid(relation), true, true);
 		foreach(lc1, nnconstrs)
 			nncols = bms_add_member(nncols,
 									((CookedConstraint *) lfirst(lc1))->attnum);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6f45efde21..d7929c9ef6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2718,6 +2718,8 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		Bitmapset  *relattnos;
 		Bitmapset  *pkattnos;
 		Oid			constraintOid;
+		Bitmapset  *unique_notnull_attnos;
+		bool 		use_pkattnos;
 
 		relid++;
 
@@ -2743,14 +2745,26 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		 * (i.e., nondeferrable) primary key constraint.
 		 */
 		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
+		/* Check if there is unique not null index. */
+		unique_notnull_attnos = get_min_unique_not_null_attnos(rte->relid);
+		if ((pkattnos == NULL) && (unique_notnull_attnos == NULL))
 			continue;
 
+		/*
+		 * Choose primary key or unique index to remove useless group by columns.
+		 * If Both primary key and unique not null index exist, choose the one with less attnum.
+		 * Primary key is preferred if their column num are equal.
+		 */
+		if (pkattnos && unique_notnull_attnos)
+			use_pkattnos = bms_num_members(unique_notnull_attnos) >= bms_num_members(pkattnos);
+		else
+			use_pkattnos = (pkattnos != NULL);
+
 		/*
 		 * If the primary key is a proper subset of relattnos then we have
 		 * some items in the GROUP BY that can be removed.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		if (use_pkattnos && bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -2763,6 +2777,16 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			/* Remember the attnos of the removable columns */
 			surplusvars[relid] = bms_difference(relattnos, pkattnos);
 		}
+
+		/* Find removable columns using unique not null index. */
+		if (!use_pkattnos && bms_subset_compare(unique_notnull_attnos, relattnos) == BMS_SUBSET1)
+		{
+			if (surplusvars == NULL)
+				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+													 (list_length(parse->rtable) + 1));
+
+			surplusvars[relid] = bms_difference(relattnos, unique_notnull_attnos);
+		}
 	}
 
 	/*
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index cf0d432ab1..fca88f6046 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1195,7 +1195,7 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla
 		 * constraints in expandTableLikeClause, so that we skip this for
 		 * those.
 		 */
-		foreach(lc, RelationGetNotNullConstraints(RelationGetRelid(relation), true))
+		foreach(lc, RelationGetNotNullConstraints(RelationGetRelid(relation), true, true))
 		{
 			CookedConstraint *cooked = (CookedConstraint *) lfirst(lc);
 
@@ -1473,7 +1473,7 @@ expandTableLikeClause(RangeVar *heapRel, TableLikeClause *table_like_clause)
 	 * Copy not-null constraints, too (these do not require any option to have
 	 * been given).
 	 */
-	foreach(lc, RelationGetNotNullConstraints(RelationGetRelid(relation), false))
+	foreach(lc, RelationGetNotNullConstraints(RelationGetRelid(relation), false, true))
 	{
 		AlterTableCmd *atsubcmd;
 
diff --git a/src/include/catalog/pg_constraint.h b/src/include/catalog/pg_constraint.h
index a026b42515..45c86dae4d 100644
--- a/src/include/catalog/pg_constraint.h
+++ b/src/include/catalog/pg_constraint.h
@@ -251,7 +251,7 @@ extern AttrNumber extractNotNullColumn(HeapTuple constrTup);
 extern bool AdjustNotNullInheritance1(Oid relid, AttrNumber attnum, int count,
 									  bool is_no_inherit);
 extern void AdjustNotNullInheritance(Oid relid, Bitmapset *columns, int count);
-extern List *RelationGetNotNullConstraints(Oid relid, bool cooked);
+extern List *RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit);
 
 extern void RemoveConstraintById(Oid conId);
 extern void RenameConstraintById(Oid conId, const char *newname);
@@ -269,6 +269,8 @@ extern Oid	get_relation_idx_constraint_oid(Oid relationId, Oid indexId);
 
 extern Bitmapset *get_primary_key_attnos(Oid relid, bool deferrableOk,
 										 Oid *constraintOid);
+extern Bitmapset *get_min_unique_not_null_attnos(Oid relid);
+
 extern void DeconstructFkConstraintRow(HeapTuple tuple, int *numfks,
 									   AttrNumber *conkey, AttrNumber *confkey,
 									   Oid *pf_eq_oprs, Oid *pp_eq_oprs, Oid *ff_eq_oprs,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index d8271da4d1..3186f2f7d7 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1398,6 +1398,64 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null, d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t1
+(3 rows)
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: d
+   ->  Seq Scan on t4
+(3 rows)
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t5
+(3 rows)
+
+drop table t1; 
+drop table t2;
+drop table t3;
+drop table t4;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 75c78be640..2337281566 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -496,6 +496,36 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null, d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+
+drop table t1; 
+drop table t2;
+drop table t3;
+drop table t4;
+
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
-- 
2.34.1

#2jian he
jian.universality@gmail.com
In reply to: Zhang Mingli (#1)
Re: Remove useless GROUP BY columns considering unique index

On Fri, Dec 29, 2023 at 11:05 PM Zhang Mingli <zmlpostgres@gmail.com> wrote:

Hi,

This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0] which points out that
unique index whose columns all have NOT NULL constraints could also take the work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

[0]/messages/by-id/CAApHDvrdYa=VhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL+PYMVN8OnQ@mail.gmail.com

cosmetic issues:
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a,
b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null,
primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int
not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null,
d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+
+drop table t1;
+drop table t2;
+drop table t3;
+drop table t4;
+

line `materailzed view` is unnecessary?
you forgot to drop table t5.

create temp table p_t2 (
a int not null,
b int not null,
c int not null,
d int,
unique(a), unique(a,b),unique(a,b,c)
) partition by list(a);
create temp table p_t2_1 partition of p_t2 for values in(1);
create temp table p_t2_2 partition of p_t2 for values in(2);
explain (costs off, verbose) select * from p_t2 group by a,b,c,d;
your patch output:
QUERY PLAN
--------------------------------------------------------------
HashAggregate
Output: p_t2.a, p_t2.b, p_t2.c, p_t2.d
Group Key: p_t2.a
-> Append
-> Seq Scan on pg_temp.p_t2_1
Output: p_t2_1.a, p_t2_1.b, p_t2_1.c, p_t2_1.d
-> Seq Scan on pg_temp.p_t2_2
Output: p_t2_2.a, p_t2_2.b, p_t2_2.c, p_t2_2.d
(8 rows)

so it seems to work with partitioned tables, maybe you should add some
test cases with partition tables.

- * XXX This is only used to create derived tables, so NO INHERIT constraints
- * are always skipped.
+ * XXX When used to create derived tables, set skip_no_inherit tp true,
+ * so that NO INHERIT constraints will be skipped.
  */
 List *
-RelationGetNotNullConstraints(Oid relid, bool cooked)
+RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit)
 {
  List   *notnulls = NIL;
  Relation constrRel;
@@ -815,7 +815,7 @@ RelationGetNotNullConstraints(Oid relid, bool cooked)
  if (conForm->contype != CONSTRAINT_NOTNULL)
  continue;
- if (conForm->connoinherit)
+ if (skip_no_inherit && conForm->connoinherit)
  continue;

I don't think you need to refactor RelationGetNotNullConstraints.
however i found it hard to explain it, (which one is parent, which one
is children is very confusing for me).
so i use the following script to test it:
DROP TABLE ATACC1, ATACC2;
CREATE TABLE ATACC1 (a int);
CREATE TABLE ATACC2 (b int,c int, unique(c)) INHERITS (ATACC1);
ALTER TABLE ATACC1 ADD NOT NULL a;
-- ALTER TABLE ATACC1 ADD NOT NULL a NO INHERIT;
ALTER TABLE ATACC2 ADD unique(a);
explain (costs off, verbose) select * from ATACC2 group by a,b,c;
-------------------------

create table t_0_3 (a int not null, b int not null, c int not null, d
int not null, unique (a, b), unique(b, c) DEFERRABLE, unique(d));
explain (costs off, verbose) select * from t_0_3 group by a,b,c,d;
QUERY PLAN
--------------------------------
HashAggregate
Output: a, b, c, d
Group Key: t_0_3.a, t_0_3.b
-> Seq Scan on public.t_0_3
Output: a, b, c, d
(5 rows)

the above part seems not right, it should be `Group Key: t_0_3.d`
so i changed to:

/* Skip constraints that are not UNIQUE */
+ if (con->contype != CONSTRAINT_UNIQUE)
+ continue;
+
+ /* Skip unique constraints that are condeferred */
+ if (con->condeferrable && con->condeferred)
+ continue;

I guess you probably have noticed, in the function
remove_useless_groupby_columns,
get_primary_key_attnos followed by get_min_unique_not_null_attnos.
Also, get_min_unique_not_null_attnos main code is very similar to
get_primary_key_attnos.

They both do `pg_constraint = table_open(ConstraintRelationId,
AccessShareLock);`
you might need to explain why we must open pg_constraint twice.
either it's cheap to do it or another reason.

If scanning twice is expensive, we may need to refactor get_primary_key_attnos.
get_primary_key_attnos only called in check_functional_grouping,
remove_useless_groupby_columns.

I added the patch to commitfest:
https://commitfest.postgresql.org/46/4751/

#3David Rowley
dgrowleyml@gmail.com
In reply to: Zhang Mingli (#1)
Re: Remove useless GROUP BY columns considering unique index

On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:

So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

This patch no longer applies. We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon. (Likely it could
just look at pg_attribute.attnotnull instead)

Aside from that, I'm not sure about the logic to prefer removing
functionally dependent columns using the constraint with the least
columns. If you had the PRIMARY KEY on two int columns and a unique
index on a 1MB varlena column, I think using the primary key would be
a better choice to remove functionally dependent columns of. Maybe
it's ok to remove any functionally dependant columns on the primary
key first and try removing functionally dependent columns on unique
constraints and a 2nd step (or maybe only if none can be removed using
the PK?)

Also, why constraints rather than unique indexes? You'd need to check
for expression indexes and likely ignore those due to no ability to
detect NOT NULL for expressions.

Also, looking at the patch to how you've coded
get_min_unique_not_null_attnos(), it looks like you're very optimistic
about that being a constraint that has columns mentioned in the GROUP
BY clause. It looks like it won't work if the UNIQUE constraint with
the least columns gets no mention in the GROUP BY clause. That'll
cause performance regressions from what we have today when the primary
key is mentioned and columns can be removed using that but a UNIQUE
constraint exists which has no corresponding GROUP BY columns and
fewer unique columns than the PK. That's not good and it'll need to be
fixed.

Set to waiting on author.

David

#4Peter Eisentraut
peter@eisentraut.org
In reply to: David Rowley (#3)
Re: Remove useless GROUP BY columns considering unique index

On 12.09.24 03:43, David Rowley wrote:

On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:

So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

This patch no longer applies. We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon.

Work is ongoing to revive the patch that catalogs not-null constraints:
<https://commitfest.postgresql.org/49/5224/&gt;. This patch should
probably wait for that patch at the moment.

(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

#5David Rowley
dgrowleyml@gmail.com
In reply to: Peter Eisentraut (#4)
Re: Remove useless GROUP BY columns considering unique index

On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut <peter@eisentraut.org> wrote:

On 12.09.24 03:43, David Rowley wrote:

(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

I'm not seeing any need to record constraint dependencies for this
optimisation. It would be different for detecting functional
dependencies in a view using a unique constraint+not null constraints
for ungrouped columns, but that's not what this is. This is just a
planner optimisation. The plan can be invalidated by a relcache
invalidation, which will happen if someone does ALTER TABLE DROP NOT
NULL.

For reference, see 5b736e9cf.

David

#6Zhang Mingli
zmlpostgres@gmail.com
In reply to: David Rowley (#5)
Re: Remove useless GROUP BY columns considering unique index

Hi, all

I haven't paid attention to this topic in a long time, thanks all for the advices, I will study them then update.

Thanks again.

Zhang Mingli
www.hashdata.xyz

Show quoted text

On Sep 18, 2024 at 15:50 +0800, David Rowley <dgrowleyml@gmail.com>, wrote:

On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut <peter@eisentraut.org> wrote:

On 12.09.24 03:43, David Rowley wrote:

(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

I'm not seeing any need to record constraint dependencies for this
optimisation. It would be different for detecting functional
dependencies in a view using a unique constraint+not null constraints
for ungrouped columns, but that's not what this is. This is just a
planner optimisation. The plan can be invalidated by a relcache
invalidation, which will happen if someone does ALTER TABLE DROP NOT
NULL.

For reference, see 5b736e9cf.

David

#7jian he
jian.universality@gmail.com
In reply to: David Rowley (#3)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Thu, Sep 12, 2024 at 9:44 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:

So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

This patch no longer applies. We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon. (Likely it could
just look at pg_attribute.attnotnull instead)

Aside from that, I'm not sure about the logic to prefer removing
functionally dependent columns using the constraint with the least
columns. If you had the PRIMARY KEY on two int columns and a unique
index on a 1MB varlena column, I think using the primary key would be
a better choice to remove functionally dependent columns of. Maybe
it's ok to remove any functionally dependant columns on the primary
key first and try removing functionally dependent columns on unique
constraints and a 2nd step (or maybe only if none can be removed using
the PK?)

Let's be conservative.
if the primary key is there, then using it to
remove other useless columns;
if not there. found out the unique-index-not-null columns,
using it to remove other columns in the groupby clause.

Also, why constraints rather than unique indexes? You'd need to check
for expression indexes and likely ignore those due to no ability to
detect NOT NULL for expressions.

some unique indexes don't have pg_constraint entry.
for example:
create table t(a int);
create unique index idx_t on t(a);

overall i come up with the attached patch.
instead of loop through pg_constraint, i loop over pg_index, extract indkey.
multiple indkey can match again groupby expression.
doing some loop, found out the indkey that has a minimum number of columns.
for example:
unique-not-null columns can be
(a,b)
(a,b,c)
since (a,b) only has two columns, using (a,b) to remove other columns
in the groupby expression.

some tests are copied from Zhang Mingli.
Code implementation is quite different.

Attachments:

v2-0001-remove-useless-group-by-columns-via-unique-index.patchtext/x-patch; charset=US-ASCII; name=v2-0001-remove-useless-group-by-columns-via-unique-index.patchDownload
From 6b6689038de51fc7c56f710f9b613c1998ea2bd1 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Fri, 22 Nov 2024 21:08:59 +0800
Subject: [PATCH v2 1/1] remove useless group by columns via unique index

in remove_useless_groupby_columns, if primary key exists and is a subset of
group by columns. then we remove non-primary-key column in group-by clause.  if
primary key is not available, then we try to use columns that aer unique and
not-null to remove other useless columns.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spark
---
 src/backend/catalog/index.c              | 99 ++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c     | 66 +++++++++++++++-
 src/include/catalog/index.h              |  1 +
 src/test/regress/expected/aggregates.out | 75 ++++++++++++++++++
 src/test/regress/sql/aggregates.sql      | 39 ++++++++++
 5 files changed, 278 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..ad97925278 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4257,3 +4257,102 @@ RestoreReindexState(const void *reindexstate)
 	/* Note the worker has its own transaction nesting level */
 	reindexingNestLevel = GetCurrentTransactionNestLevel();
 }
+
+/*
+ * given a input relation oid, return a list of attnums that have unique index
+ * and not-null constraint associated with it. note we do not check this
+ * relation's primary key.
+*/
+List *
+get_unique_not_null_attnos(Oid relid)
+{
+	List	   	*not_null_attnos = NIL;
+	Relation	pg_index;
+	HeapTuple	indexTuple;
+	SysScanDesc scan;
+	ScanKeyData skey;
+	List 		*not_null_cs;
+	List	   *tlist = NIL;
+
+	/*
+	 * Use cooked to fetch attnos. note: parimary key have a seperated not-null
+	 * constraint
+	 * */
+	not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return NULL;
+
+	foreach_ptr(CookedConstraint, cc, not_null_cs)
+		not_null_attnos = lappend_int(not_null_attnos, cc->attnum);
+
+	/* Scan pg_index for unique index of the target rel */
+	pg_index = table_open(IndexRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey,
+				Anum_pg_index_indrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	scan = systable_beginscan(pg_index, IndexIndrelidIndexId, true,
+							  NULL, 1, &skey);
+
+	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+	{
+		Bitmapset  	*unique_notnull_attnos = NULL;
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+		Datum		adatum;
+		bool		isNull;
+		int			numkeys;
+		bool 		all_notnull;
+		Datum	   *keys;
+		int			nKeys;
+
+		/* we're only interested in unique index */
+		if (!index->indisunique || index->indisprimary)
+			continue;
+
+		/* Skip invalid, exclusion index or deferred index */
+		if (!index->indisvalid || index->indisexclusion || !index->indimmediate)
+			continue;
+
+		/* Skip expression index or predicate index */
+		if (!heap_attisnull(indexTuple, Anum_pg_index_indpred, NULL) ||
+			!heap_attisnull(indexTuple, Anum_pg_index_indexprs, NULL))
+			continue;
+
+		/* Extract the pg_index->indkey int2vector */
+		adatum = heap_getattr(indexTuple, Anum_pg_index_indkey,
+							  RelationGetDescr(pg_index), &isNull);
+		if (isNull)
+			elog(ERROR, "null int2vector for index %u", index->indexrelid);
+
+		deconstruct_array_builtin(DatumGetArrayTypeP(adatum), INT2OID,
+									&keys, NULL, &nKeys);
+		if(nKeys != index->indnatts)
+			elog(ERROR, "corrupted int2vector for index %u", index->indexrelid);
+
+		Assert(nKeys >= index->indnkeyatts);
+		numkeys = index->indnkeyatts;
+
+		all_notnull = true;
+		for (int i = 0; i < numkeys; i++)
+		{
+			/* Skip if unique key attnum don't have not-null */
+			if (!list_member_int(not_null_attnos, DatumGetInt16(keys[i])))
+			{
+				all_notnull = false;
+				break;
+			}
+			unique_notnull_attnos = bms_add_member(unique_notnull_attnos,
+									  DatumGetInt16(keys[i]) - FirstLowInvalidHeapAttributeNumber);
+		}
+		if (unique_notnull_attnos != NULL)
+			tlist = lappend(tlist, unique_notnull_attnos);
+
+		if(!all_notnull)
+			continue;
+	}
+	systable_endscan(scan);
+
+	table_close(pg_index, AccessShareLock);
+	return tlist;
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1f78dc3d53..4341ac9aae 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "catalog/index.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
@@ -2797,6 +2798,9 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		Bitmapset  *relattnos;
 		Bitmapset  *pkattnos;
 		Oid			constraintOid;
+		Bitmapset  	*attnums = NULL;
+		List  		*unique_notnull_attnos = NIL;
+		List	   	*matched = NIL;
 
 		relid++;
 
@@ -2819,17 +2823,67 @@ remove_useless_groupby_columns(PlannerInfo *root)
 
 		/*
 		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
+		 * (i.e., nondeferrable) primary key constraint.However if there is
+		 * columns both have unique-index and not-null constraint, we can also
+		 * remove some columns. If primary key is there, we won't looking for
+		 * unique-not-null columns.
 		 */
 		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
 		if (pkattnos == NULL)
+		{
+			/* get unique not null index */
+			unique_notnull_attnos = get_unique_not_null_attnos(rte->relid);
+			if (list_length(unique_notnull_attnos) > 0)
+			{
+				int		out_num,
+						inner_num;
+
+				foreach_node(Bitmapset, attnos, unique_notnull_attnos)
+				{
+					Assert(attnos != NULL);
+
+					if (bms_subset_compare(attnos, relattnos) != BMS_SUBSET1)
+						continue;
+					else
+						matched = lappend(matched, attnos);
+				}
+
+				/* there can be many unique-index-not-null attnums that are
+				* subset of groupbyattnos.  we choose one that is less number of
+				* members.if there are many equal number of attnums, we simply
+				* choose the first one.  we don't use foreach here, since we
+				* delete cells in list
+				*/
+				for (int outerpos = 0; outerpos < list_length(matched); outerpos++)
+				{
+					Bitmapset *bt;
+					bt = (Bitmapset *) list_nth(matched, outerpos);
+
+					out_num = bms_num_members(bt);
+					for (int restpos = outerpos + 1; restpos < list_length(matched);)
+					{
+						Bitmapset *other;
+						other = (Bitmapset *) list_nth(matched, restpos);
+						inner_num =	bms_num_members(other);
+						if (inner_num > out_num)
+							matched = list_delete_nth_cell(matched, restpos);
+						else if (inner_num < out_num)
+							matched = list_delete_nth_cell(matched, outerpos);
+						restpos++;
+					}
+				}
+				if (list_length(matched) > 0)
+					attnums = (Bitmapset *) list_nth(matched, 0);
+			}
+		}
+		if ((pkattnos == NULL && attnums == NULL))
 			continue;
 
 		/*
 		 * If the primary key is a proper subset of relattnos then we have
 		 * some items in the GROUP BY that can be removed.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		if (pkattnos != NULL && bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -2842,6 +2896,14 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			/* Remember the attnos of the removable columns */
 			surplusvars[relid] = bms_difference(relattnos, pkattnos);
 		}
+		else if(attnums != NULL)
+		{
+			if (surplusvars == NULL)
+				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+													 (list_length(parse->rtable) + 1));
+
+			surplusvars[relid] = bms_difference(relattnos, attnums);
+		}
 	}
 
 	/*
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..dacc0f6ebd 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,7 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern List *get_unique_not_null_attnos(Oid relid);
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..9bd91c4f38 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1454,6 +1454,81 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+ALTER TABLE t5 ATTACH PARTITION t5_0 FOR VALUES FROM (0,0) TO (10, 10);
+ALTER TABLE t5 ATTACH PARTITION t5_1 FOR VALUES FROM (10,10) TO (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on t1
+(3 rows)
+
+-- both unique not null index and primary key is there,
+-- using primary key to remove useless group by columns
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t2
+(3 rows)
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.a, t5.b
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+drop table t1, t2,t3,t4,t5;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..d07c599a04 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -512,6 +512,45 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+ALTER TABLE t5 ATTACH PARTITION t5_0 FOR VALUES FROM (0,0) TO (10, 10);
+ALTER TABLE t5 ATTACH PARTITION t5_1 FOR VALUES FROM (10,10) TO (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- both unique not null index and primary key is there,
+-- using primary key to remove useless group by columns
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+
+drop table t1, t2,t3,t4,t5;
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
-- 
2.34.1

#8jian he
jian.universality@gmail.com
In reply to: jian he (#7)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Fri, Nov 22, 2024 at 9:24 PM jian he <jian.universality@gmail.com> wrote:

overall i come up with the attached patch.

in v2:

create table t1 (a int, b int not null, c int, unique(b, c));
explain(costs off) select count(*) from t1 group by b,c,a;
QUERY PLAN
----------------------
HashAggregate
Group Key: b
-> Seq Scan on t1

so the v2 implementation is wrong.
I overlooked cases like: only part of the unique-key is not-null.
In that situation, we cannot remove useless groupby columns.

v3 attached. in v3:
QUERY PLAN
----------------------
HashAggregate
Group Key: b, c, a
-> Seq Scan on t1

Attachments:

v3-0001-remove-useless-group-by-columns-via-unique-not-nu.patchtext/x-patch; charset=US-ASCII; name=v3-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From 85a64d350f0282c4bfb79a0f70d57eca29a78308 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Wed, 27 Nov 2024 14:45:03 +0800
Subject: [PATCH v3 1/1] remove useless group by columns via unique-not-null

in remove_useless_groupby_columns, if primary key exists and is a subset of
group by columns. then we remove non-primary-key column in group-by clause.
if primary key is not available, then we try to use columns that are unique and
not-null to remove other useless columns.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spark
---
 src/backend/catalog/index.c              | 94 +++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c     | 74 +++++++++++++++++-
 src/include/catalog/index.h              |  1 +
 src/test/regress/expected/aggregates.out | 96 ++++++++++++++++++++++++
 src/test/regress/sql/aggregates.sql      | 47 ++++++++++++
 5 files changed, 309 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..cd5b1ff3e6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4257,3 +4257,97 @@ RestoreReindexState(const void *reindexstate)
 	/* Note the worker has its own transaction nesting level */
 	reindexingNestLevel = GetCurrentTransactionNestLevel();
 }
+
+/*
+ * given a input relation oid, return a list of attnum Bitmapset(soure:
+ * pg_index.indkey) that have unique index *and* not-null constraint associate
+ * with it. note we do not check this relation's primary key.
+*/
+List *
+get_unique_not_null_attnos(Oid relid)
+{
+	List	   	*not_null_attnos = NIL;
+	Relation	pg_index;
+	HeapTuple	indexTuple;
+	SysScanDesc scan;
+	ScanKeyData skey;
+	List 		*not_null_cs;
+	List	   *tlist = NIL;
+
+	/* Use CookedConstraint to fetch not-null constraint. */
+	not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return NULL;
+
+	foreach_ptr(CookedConstraint, cc, not_null_cs)
+		not_null_attnos = lappend_int(not_null_attnos, cc->attnum);
+
+	/* Scan pg_index for unique index of the target rel */
+	pg_index = table_open(IndexRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey,
+				Anum_pg_index_indrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	scan = systable_beginscan(pg_index, IndexIndrelidIndexId, true,
+							  NULL, 1, &skey);
+
+	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+	{
+		Bitmapset  	*unique_notnull_attnos = NULL;
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+		Datum		adatum;
+		bool		isNull;
+		bool		all_notnull = true;
+		int			numkeys;
+		Datum	   *keys;
+		int			nKeys;
+
+		/* we're only interested in unique index */
+		if (!index->indisunique || index->indisprimary)
+			continue;
+
+		/* Skip invalid, exclusion index or deferred index */
+		if (!index->indisvalid || index->indisexclusion || !index->indimmediate)
+			continue;
+
+		/* Skip expression index or predicate index */
+		if (!heap_attisnull(indexTuple, Anum_pg_index_indpred, NULL) ||
+			!heap_attisnull(indexTuple, Anum_pg_index_indexprs, NULL))
+			continue;
+
+		/* Extract the pg_index->indkey int2vector */
+		adatum = heap_getattr(indexTuple, Anum_pg_index_indkey,
+							  RelationGetDescr(pg_index), &isNull);
+		if (isNull)
+			elog(ERROR, "null int2vector for index %u", index->indexrelid);
+
+		deconstruct_array_builtin(DatumGetArrayTypeP(adatum), INT2OID,
+									&keys, NULL, &nKeys);
+
+		if(nKeys != index->indnatts)
+			elog(ERROR, "corrupted int2vector for index %u", index->indexrelid);
+
+		Assert(nKeys >= index->indnkeyatts);
+		numkeys = index->indnkeyatts;
+
+		for (int i = 0; i < numkeys; i++)
+		{
+			/* Skip if any of unique key's attnum don't have not-null */
+			if (!list_member_int(not_null_attnos, DatumGetInt16(keys[i])))
+			{
+				all_notnull = false;
+				break;
+			}
+
+			unique_notnull_attnos = bms_add_member(unique_notnull_attnos,
+									  DatumGetInt16(keys[i]) - FirstLowInvalidHeapAttributeNumber);
+		}
+		if (all_notnull)
+			tlist = lappend(tlist, unique_notnull_attnos);
+	}
+	systable_endscan(scan);
+
+	table_close(pg_index, AccessShareLock);
+	return tlist;
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..39d92b2b53 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "catalog/index.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
@@ -2807,6 +2808,9 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		Bitmapset  *relattnos;
 		Bitmapset  *pkattnos;
 		Oid			constraintOid;
+		Bitmapset  	*attnums = NULL;
+		List  		*unique_notnull_attnos = NIL;
+		List	   	*matched = NIL;
 
 		relid++;
 
@@ -2828,18 +2832,69 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			continue;
 
 		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
+		 * Can't remove any columns for this rel if there is no suitable (i.e.,
+		 * nondeferrable) primary key constraint. However if there is columns
+		 * both have unique-index and not-null constraint, we can also remove
+		 * some columns. If primary key is there, we won't looking for
+		 * unique-not-null constraints.
 		 */
 		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
 		if (pkattnos == NULL)
+		{
+			unique_notnull_attnos = get_unique_not_null_attnos(rte->relid);
+			if (list_length(unique_notnull_attnos) > 0)
+			{
+				int		out_num,
+						inner_num;
+
+				foreach_node(Bitmapset, attnos, unique_notnull_attnos)
+				{
+					Assert(attnos != NULL);
+
+					if (bms_subset_compare(attnos, relattnos) != BMS_SUBSET1)
+						continue;
+					else
+						matched = lappend(matched, attnos);
+				}
+				/*
+				 * There may be multiple unique-index-not-null attnums that are
+				 * subsets of groupbyattnos.  We select the one with the fewest
+				 * members.  If there are several with the same number of
+				 * members , we choose the first one. We avoid using `foreach`
+				 * here because we delete items from the list.
+				*/
+				for (int outerpos = 0; outerpos < list_length(matched); outerpos++)
+				{
+					Bitmapset *bt;
+					bt = list_nth_node(Bitmapset, matched, outerpos);
+
+					out_num = bms_num_members(bt);
+					for (int restpos = outerpos + 1; restpos < list_length(matched);)
+					{
+						Bitmapset *other;
+						other = list_nth_node(Bitmapset, matched, restpos);
+
+						inner_num =	bms_num_members(other);
+						if (inner_num > out_num)
+							matched = list_delete_nth_cell(matched, restpos);
+						else if (inner_num < out_num)
+							matched = list_delete_nth_cell(matched, outerpos);
+						restpos++;
+					}
+				}
+				/* If there is at least one candidate, select the first one. */
+				if (list_length(matched) > 0)
+					attnums = list_nth_node(Bitmapset, matched, 0);
+			}
+		}
+		if ((pkattnos == NULL && attnums == NULL))
 			continue;
 
 		/*
 		 * If the primary key is a proper subset of relattnos then we have
 		 * some items in the GROUP BY that can be removed.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		if (pkattnos != NULL && bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -2852,6 +2907,19 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			/* Remember the attnos of the removable columns */
 			surplusvars[relid] = bms_difference(relattnos, pkattnos);
 		}
+		/*
+		 * If attnums is not NULL then this attnums (unique-index-not-null) is a
+		 * proper subset of relattnos. see above. we can remove some items in
+		 * GROUOP BY
+		 */
+		else if(attnums != NULL)
+		{
+			if (surplusvars == NULL)
+				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+													 (list_length(parse->rtable) + 1));
+
+			surplusvars[relid] = bms_difference(relattnos, attnums);
+		}
 	}
 
 	/*
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..dacc0f6ebd 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,7 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern List *get_unique_not_null_attnos(Oid relid);
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..563b5c1c0d 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1454,6 +1454,102 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+ALTER TABLE t5 ATTACH PARTITION t5_0 FOR VALUES FROM (0,0) TO (10, 10);
+ALTER TABLE t5 ATTACH PARTITION t5_1 FOR VALUES FROM (10,10) TO (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+create temp table t6 (a int, b int not null, c int, d int, unique(b, c));
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on t1
+(3 rows)
+
+-- both unique not null index and primary key is there,
+-- using primary key to remove useless group by columns
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t2
+(3 rows)
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.a, t5.b
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+--can reduce to a,b, groupby order or having duplicates does not matter
+explain (costs off) select count(*) from t5 group by b,a,a, d, c;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.b, t5.a
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+---only part of unique key columns is not-null, cannot use it to remove columns.
+explain(costs off) select count(*) from t6 group by b,c,a,d;
+       QUERY PLAN        
+-------------------------
+ HashAggregate
+   Group Key: b, c, a, d
+   ->  Seq Scan on t6
+(3 rows)
+
+drop table t1, t2,t3,t4,t5,t6;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..66f4f45953 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -512,6 +512,53 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+ALTER TABLE t5 ATTACH PARTITION t5_0 FOR VALUES FROM (0,0) TO (10, 10);
+ALTER TABLE t5 ATTACH PARTITION t5_1 FOR VALUES FROM (10,10) TO (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+
+create temp table t6 (a int, b int not null, c int, d int, unique(b, c));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- both unique not null index and primary key is there,
+-- using primary key to remove useless group by columns
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+
+--can reduce to a,b, groupby order or having duplicates does not matter
+explain (costs off) select count(*) from t5 group by b,a,a, d, c;
+
+---only part of unique key columns is not-null, cannot use it to remove columns.
+explain(costs off) select count(*) from t6 group by b,c,a,d;
+
+drop table t1, t2,t3,t4,t5,t6;
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
-- 
2.34.1

#9David Rowley
dgrowleyml@gmail.com
In reply to: jian he (#8)
Re: Remove useless GROUP BY columns considering unique index

On Wed, 27 Nov 2024 at 19:51, jian he <jian.universality@gmail.com> wrote:

v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=37.75..56.25 rows=1850 width=8)
Group Key: b, c
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=33.12..51.62 rows=1850 width=8)
Group Key: b
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

The patch could also use a spell check:

"a input" (an input)
"soure" source??
"GROUOP BY"

David

#10jian he
jian.universality@gmail.com
In reply to: David Rowley (#9)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Wed, Nov 27, 2024 at 5:30 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 27 Nov 2024 at 19:51, jian he <jian.universality@gmail.com> wrote:

v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=37.75..56.25 rows=1850 width=8)
Group Key: b, c
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=33.12..51.62 rows=1850 width=8)
Group Key: b
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

fixed. aslo have tests on it.

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

now no need to scan pg_constrtraint, just one scan pg_index, collect info.
i aslo removed get_primary_key_attnos.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

fixed

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

now get_unique_not_null_attnos is:
List * get_unique_not_null_attnos(Oid relid, int *pk_pos)

returned pk_pos is pk attnums index (zero based) within the returned list.
current pk_pos no use, but i guess it should be useful.
so we can quickly retrieve pk attnums.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

this part has not been addressed yet.

v4 logic is to choose one with the least number of columns.
if there is more than one with the least number of columns, simply
choose the first one
in the matched list.

i think i addressed most of your concern, now the code seems pretty neat.

Attachments:

v4-0001-remove-useless-group-by-columns-via-unique-not-nu.patchtext/x-patch; charset=US-ASCII; name=v4-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From d585cf0193a700501a153f75f653aa953e8a9d02 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Thu, 28 Nov 2024 12:37:22 +0800
Subject: [PATCH v4 1/1] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spar
---
 src/backend/catalog/index.c              | 101 +++++++++++++++++++++
 src/backend/optimizer/plan/planner.c     |  66 +++++++++-----
 src/include/catalog/index.h              |   1 +
 src/test/regress/expected/aggregates.out | 106 +++++++++++++++++++++++
 src/test/regress/sql/aggregates.sql      |  51 +++++++++++
 5 files changed, 304 insertions(+), 21 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..841522ab4a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4257,3 +4257,104 @@ RestoreReindexState(const void *reindexstate)
 	/* Note the worker has its own transaction nesting level */
 	reindexingNestLevel = GetCurrentTransactionNestLevel();
 }
+
+/*
+ * given an input relation oid, return a list of Bitmapset(source:
+ * pg_index.indkey) that has a unique index *and* not-null constraint associated
+ * with it. note this includes the primary key. returned pk_pos is primary key's
+ * attnums index within the returned List
+*/
+List *
+get_unique_not_null_attnos(Oid relid, int *pk_pos)
+{
+	Bitmapset	*not_null_attnos = NULL;
+	Relation	pg_index;
+	HeapTuple	indexTuple;
+	SysScanDesc scan;
+	ScanKeyData skey;
+	List 		*not_null_cs;
+	List	   *tlist = NIL;
+	int			attno;
+
+	/* Use CookedConstraint to fetch not-null constraint. */
+	not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return NULL;
+
+	foreach_ptr(CookedConstraint, cc, not_null_cs)
+		not_null_attnos = bms_add_member(not_null_attnos, cc->attnum - FirstLowInvalidHeapAttributeNumber);
+
+	/* Scan pg_index for unique index of the target rel */
+	pg_index = table_open(IndexRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey,
+				Anum_pg_index_indrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	scan = systable_beginscan(pg_index, IndexIndrelidIndexId, true,
+							  NULL, 1, &skey);
+
+	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+	{
+		Bitmapset  	*unique_notnull_attnos = NULL;
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+		Datum		adatum;
+		bool		isNull;
+		bool		all_notnull = true;
+		int			numkeys;
+		Datum	   *keys;
+		int			nKeys,
+					pos = 0;
+
+		/* we're only interested in unique index */
+		if (!index->indisunique)
+			continue;
+
+		/* Skip invalid or deferred index */
+		if (!index->indisvalid || !index->indimmediate)
+			continue;
+
+		/* Skip expression index or predicate index */
+		if (!heap_attisnull(indexTuple, Anum_pg_index_indpred, NULL) ||
+			!heap_attisnull(indexTuple, Anum_pg_index_indexprs, NULL))
+			continue;
+
+		/* Extract the pg_index->indkey int2vector */
+		adatum = heap_getattr(indexTuple, Anum_pg_index_indkey,
+							  RelationGetDescr(pg_index), &isNull);
+		if (isNull)
+			elog(ERROR, "null int2vector for index %u", index->indexrelid);
+
+		deconstruct_array_builtin(DatumGetArrayTypeP(adatum), INT2OID,
+									&keys, NULL, &nKeys);
+
+		if(nKeys != index->indnatts)
+			elog(ERROR, "corrupted int2vector for index %u", index->indexrelid);
+
+		Assert(nKeys >= index->indnkeyatts);
+		numkeys = index->indnkeyatts;
+
+		for (int i = 0; i < numkeys; i++)
+		{
+			/* Skip if any of unique key's attnum don't have not-null */
+			attno = DatumGetInt16(keys[i]) - FirstLowInvalidHeapAttributeNumber;
+			if (!bms_is_member(attno, not_null_attnos))
+			{
+				all_notnull = false;
+				break;
+			}
+			unique_notnull_attnos = bms_add_member(unique_notnull_attnos, attno);
+		}
+		if (all_notnull)
+		{
+			if (index->indisprimary)
+				*pk_pos = pos;
+			pos++;
+			tlist = lappend(tlist, unique_notnull_attnos);
+		}
+	}
+	systable_endscan(scan);
+
+	table_close(pg_index, AccessShareLock);
+	return tlist;
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..2990e0412f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "catalog/index.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
@@ -2805,8 +2806,10 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
 		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
+		List  		*unique_notnull_attnos = NIL;
+		List	   	*matched = NIL;
+		Bitmapset 	*best_attnos = NULL;
+		int			pk_pos = -1;
 
 		relid++;
 
@@ -2828,30 +2831,51 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			continue;
 
 		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
+		 * Can't remove any columns for this rel if there is no suitable (i.e.,
+		 * nondeferrable) primary key constraint. we can remove some columns if:
+		 * at least one unique index and not-null constraint asscoiated attnums
+		 * is subsset of groupby columns.  here,primary key is treated as unique
+		 * index and not-null constraint.
 		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
+		unique_notnull_attnos = get_unique_not_null_attnos(rte->relid, &pk_pos);
 
-		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
-		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		if (list_length(unique_notnull_attnos) > 0)
 		{
-			/*
-			 * To easily remember whether we've found anything to do, we don't
-			 * allocate the surplusvars[] array until we find something.
-			 */
-			if (surplusvars == NULL)
-				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-													 (list_length(parse->rtable) + 1));
+			foreach_node(Bitmapset, attnos, unique_notnull_attnos)
+			{
+				Assert(attnos != NULL);
+				if (bms_subset_compare(attnos, relattnos) != BMS_SUBSET1)
+					continue;
+				else
+					matched = lappend(matched, attnos);
+			}
 
-			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+			if (list_length(matched) == 0)
+				continue;
+
+			best_attnos = list_nth_node(Bitmapset, matched, 0);
+			for (int i = 0; i < list_length(matched); i++)
+			{
+				Bitmapset *bt = list_nth_node(Bitmapset, matched, i);
+
+				if (bms_num_members(bt) < bms_num_members(best_attnos))
+					best_attnos = bt;
+			}
 		}
+		else
+			continue;
+
+		Assert(best_attnos);
+		/*
+		 * To easily remember whether we've found anything to do, we don't
+		 * allocate the surplusvars[] array until we find something.
+		*/
+		if (surplusvars == NULL)
+			surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+													(list_length(parse->rtable) + 1));
+
+		/* Remember the attnos of the removable columns */
+		surplusvars[relid] = bms_difference(relattnos, best_attnos);
 	}
 
 	/*
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..b171102c76 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,7 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern List *get_unique_not_null_attnos(Oid relid, int *pk_pos);
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..c60615cfbd 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1454,6 +1454,112 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+alter table  t5 attach partition t5_0 for values from (0,0) to (10, 10);
+alter table  t5 attach partition t5_1 for values from (10,10) to (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+create temp table t6 (a int, b int not null, c int, d int, unique(b, c));
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on t1
+(3 rows)
+
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+--can remove column b, since c has unique index and is not-null
+explain (costs off) select count(*) from t2 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+--more than one candidate can be used to remove other column.
+--optimizer will use primary key or unique index.
+explain (costs off) select * from t3 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.a, t5.b
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+--can reduce to a,b, groupby order or having duplicates does not matter
+explain (costs off) select count(*) from t5 group by b,a,a, d, c;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.b, t5.a
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+---only part of unique key columns is not-null, cannot use it to remove columns.
+explain(costs off) select count(*) from t6 group by b,c,a,d;
+       QUERY PLAN        
+-------------------------
+ HashAggregate
+   Group Key: b, c, a, d
+   ->  Seq Scan on t6
+(3 rows)
+
+drop table t1, t2,t3,t4,t5,t6;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..b4b16b29a8 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -512,6 +512,57 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+alter table  t5 attach partition t5_0 for values from (0,0) to (10, 10);
+alter table  t5 attach partition t5_1 for values from (10,10) to (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+
+create temp table t6 (a int, b int not null, c int, d int, unique(b, c));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+
+--can remove column b, since c has unique index and is not-null
+explain (costs off) select count(*) from t2 group by b,c;
+
+--more than one candidate can be used to remove other column.
+--optimizer will use primary key or unique index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+
+--can reduce to a,b, groupby order or having duplicates does not matter
+explain (costs off) select count(*) from t5 group by b,a,a, d, c;
+
+---only part of unique key columns is not-null, cannot use it to remove columns.
+explain(costs off) select count(*) from t6 group by b,c,a,d;
+
+drop table t1, t2,t3,t4,t5,t6;
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
-- 
2.34.1

#11David Rowley
dgrowleyml@gmail.com
In reply to: jian he (#10)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Thu, 28 Nov 2024 at 17:39, jian he <jian.universality@gmail.com> wrote:

v4 logic is to choose one with the least number of columns.
if there is more than one with the least number of columns, simply
choose the first one
in the matched list.

The new code inside remove_useless_groupby_columns() is still more
complex than what's needed. There's no need to build a 2nd list with
the matching indexes and loop over it. Just skip the non-matching
indexes and record the best match.

I imagined it would look a bit more like the attached patch. I also
removed the pk_pos column as I didn't see any point in it. You were
not using it. If some future code needs that, it can be added then. I
also fixed up a few comments that needed attention. I didn't look at
the regression tests being added. I think given that there's now no
special case for primary key indexes that we don't need a completely
new set of tests. I think it'll make more sense to adjust some of the
existing tests to use a unique constraint instead of a PK and then
adjust a column's NOT NULL property to check that part of the code is
working correctly.

Another issue with this is that the list of indexes being used is not
sorted by Oid. If you look at RelationGetIndexList() you'll see that
we perform a sort. That gives us more consistency in the planner. I
think this patch should be doing that too, otherwise, you could end up
with a plan change after some operation that changes the order that
the indexes are stored in the pg_index table. It's probably fairly
unlikely, but it is the sort of issue that someone will eventually
discover and report.

David

Attachments:

v5-0001-remove-useless-group-by-columns-via-unique-not-nu.patchapplication/octet-stream; name=v5-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From 99d9072b2fb16e526a1478bee575943045bab786 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Thu, 28 Nov 2024 12:37:22 +0800
Subject: [PATCH v5] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spar
---
 src/backend/catalog/index.c              | 103 ++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c     |  55 +++++++++---
 src/include/catalog/index.h              |   1 +
 src/test/regress/expected/aggregates.out | 106 +++++++++++++++++++++++
 src/test/regress/sql/aggregates.sql      |  51 +++++++++++
 5 files changed, 305 insertions(+), 11 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..80a1e44167 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4257,3 +4257,106 @@ RestoreReindexState(const void *reindexstate)
 	/* Note the worker has its own transaction nesting level */
 	reindexingNestLevel = GetCurrentTransactionNestLevel();
 }
+
+/*
+ * get_unique_not_null_attnos
+ *		Returns a List of Bitmapsets with the attribute numbers (offset by
+ *		FirstLowInvalidHeapAttributeNumber) of all valid non-partial unique
+ *		indexes which contain only non-nullable columns.
+ */
+List *
+get_unique_not_null_attnos(Oid relid)
+{
+	Bitmapset	*not_null_attnos = NULL;
+	Relation	pg_index;
+	HeapTuple	indexTuple;
+	SysScanDesc scan;
+	ScanKeyData skey;
+	List 		*not_null_cs;
+	List	   *indlist = NIL;
+
+	/* Use CookedConstraint to fetch not-null constraint. */
+	not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return NULL;
+
+	foreach_ptr(CookedConstraint, cc, not_null_cs)
+		not_null_attnos = bms_add_member(not_null_attnos, cc->attnum - FirstLowInvalidHeapAttributeNumber);
+
+	/* Scan pg_index for unique index of the target rel */
+	pg_index = table_open(IndexRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey,
+				Anum_pg_index_indrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	scan = systable_beginscan(pg_index, IndexIndrelidIndexId, true,
+							  NULL, 1, &skey);
+
+	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+	{
+		Bitmapset  	*unique_notnull_attnos = NULL;
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+		Datum		adatum;
+		bool		isNull;
+		bool		all_notnull = true;
+		int			numkeys;
+		Datum	   *keys;
+		int			nelems;
+
+		/* we're only interested in unique index */
+		if (!index->indisunique)
+			continue;
+
+		/* Skip invalid or deferred index */
+		if (!index->indisvalid || !index->indimmediate)
+			continue;
+
+		/* Skip expression index or predicate index */
+		if (!heap_attisnull(indexTuple, Anum_pg_index_indpred, NULL) ||
+			!heap_attisnull(indexTuple, Anum_pg_index_indexprs, NULL))
+			continue;
+
+		/* Extract the pg_index->indkey int2vector */
+		adatum = heap_getattr(indexTuple, Anum_pg_index_indkey,
+							  RelationGetDescr(pg_index), &isNull);
+		if (isNull)
+			elog(ERROR, "null int2vector for index %u", index->indexrelid);
+
+		deconstruct_array_builtin(DatumGetArrayTypeP(adatum), INT2OID,
+								  &keys,
+								  NULL,
+								  &nelems);
+
+		if (nelems != index->indnatts)
+			elog(ERROR, "corrupted int2vector for index %u", index->indexrelid);
+
+		Assert(nelems >= index->indnkeyatts);
+		numkeys = index->indnkeyatts;
+
+		for (int i = 0; i < numkeys; i++)
+		{
+			int	attno = DatumGetInt16(keys[i]) - FirstLowInvalidHeapAttributeNumber;
+
+			/* ignore this index if any columns are NULLable */
+			if (!bms_is_member(attno, not_null_attnos))
+			{
+				all_notnull = false;
+				break;
+			}
+
+			unique_notnull_attnos = bms_add_member(unique_notnull_attnos, attno);
+		}
+
+		/* if the index has no NULLable columns, add it to the list */
+		if (all_notnull)
+			indlist = lappend(indlist, unique_notnull_attnos);
+	}
+
+	systable_endscan(scan);
+
+	/* TODO sort the indlist by index OID */
+
+	table_close(pg_index, AccessShareLock);
+	return indlist;
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..1803e5556a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "catalog/index.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
@@ -2805,8 +2806,10 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
 		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
+		Bitmapset  *best_attnos = NULL;
+		List	   *unique_notnull_attnos;
+		int32		best_num_attnos = PG_INT32_MAX;
+		ListCell   *lc2;
 
 		relid++;
 
@@ -2828,18 +2831,48 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			continue;
 
 		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
+		 * Fetch details about all suitable unique indexes with NOT NULL
+		 * constraints on all columns.
 		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
+		unique_notnull_attnos = get_unique_not_null_attnos(rte->relid);
 
 		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		foreach(lc2, unique_notnull_attnos)
+		{
+			Bitmapset  *ind_attnos = lfirst_node(Bitmapset, lc2);
+			int			members;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			members = bms_num_members(ind_attnos);
+
+			if (members < best_num_attnos)
+			{
+				best_attnos = ind_attnos;
+				best_num_attnos = members;
+			}
+		}
+
+		list_free(unique_notnull_attnos);
+
+		if (best_attnos != NULL)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -2850,7 +2883,7 @@ remove_useless_groupby_columns(PlannerInfo *root)
 													 (list_length(parse->rtable) + 1));
 
 			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+			surplusvars[relid] = bms_difference(relattnos, best_attnos);
 		}
 	}
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..dacc0f6ebd 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,7 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern List *get_unique_not_null_attnos(Oid relid);
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..c60615cfbd 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1454,6 +1454,112 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+alter table  t5 attach partition t5_0 for values from (0,0) to (10, 10);
+alter table  t5 attach partition t5_1 for values from (10,10) to (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+create temp table t6 (a int, b int not null, c int, d int, unique(b, c));
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on t1
+(3 rows)
+
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+--can remove column b, since c has unique index and is not-null
+explain (costs off) select count(*) from t2 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+--more than one candidate can be used to remove other column.
+--optimizer will use primary key or unique index.
+explain (costs off) select * from t3 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c, b
+   ->  Seq Scan on t4
+(3 rows)
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.a, t5.b
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+--can reduce to a,b, groupby order or having duplicates does not matter
+explain (costs off) select count(*) from t5 group by b,a,a, d, c;
+            QUERY PLAN             
+-----------------------------------
+ HashAggregate
+   Group Key: t5.b, t5.a
+   ->  Append
+         ->  Seq Scan on t5_0 t5_1
+         ->  Seq Scan on t5_1 t5_2
+(5 rows)
+
+---only part of unique key columns is not-null, cannot use it to remove columns.
+explain(costs off) select count(*) from t6 group by b,c,a,d;
+       QUERY PLAN        
+-------------------------
+ HashAggregate
+   Group Key: b, c, a, d
+   ->  Seq Scan on t6
+(3 rows)
+
+drop table t1, t2,t3,t4,t5,t6;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..b4b16b29a8 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -512,6 +512,57 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t1 (a int, b int, c int, unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
+create temp table t4 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b), unique(b, c), unique(a, c, d));
+
+create table t5 (a int not null, b int not null,
+                      c int not null, d int not null,
+                      unique (a, b)) partition by range(a, b);
+create table t5_0 (like t5 including all);
+create table t5_1 (like t5 including all);
+alter table  t5 attach partition t5_0 for values from (0,0) to (10, 10);
+alter table  t5 attach partition t5_1 for values from (10,10) to (21, 21);
+insert into t5 select g,g+1, g+2, g+3 from generate_series(1, 20) g;
+
+create temp table t6 (a int, b int not null, c int, d int, unique(b, c));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+
+--can remove column b, since c has unique index and is not-null
+explain (costs off) select count(*) from t2 group by b,c;
+
+--more than one candidate can be used to remove other column.
+--optimizer will use primary key or unique index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+--can remove column d from groupby expression, since we have unique index(b,c)
+explain (costs off) select count(*) from t4 group by c, d, b;
+
+--partition case, can reduce to a,b.
+explain (costs off) select count(*) from t5 group by a, d, c,b;
+
+--can reduce to a,b, groupby order or having duplicates does not matter
+explain (costs off) select count(*) from t5 group by b,a,a, d, c;
+
+---only part of unique key columns is not-null, cannot use it to remove columns.
+explain(costs off) select count(*) from t6 group by b,c,a,d;
+
+drop table t1, t2,t3,t4,t5,t6;
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
-- 
2.34.1

#12jian he
jian.universality@gmail.com
In reply to: David Rowley (#11)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Thu, Nov 28, 2024 at 2:11 PM David Rowley <dgrowleyml@gmail.com> wrote:

I think it'll make more sense to adjust some of the
existing tests to use a unique constraint instead of a PK and then
adjust a column's NOT NULL property to check that part of the code is
working correctly.

looking around, i inserted some tests to indexing.sql,
create_index.sql, aggregates.sql
also added tests for sort by index oid.

Another issue with this is that the list of indexes being used is not
sorted by Oid. If you look at RelationGetIndexList() you'll see that
we perform a sort. That gives us more consistency in the planner. I
think this patch should be doing that too, otherwise, you could end up
with a plan change after some operation that changes the order that
the indexes are stored in the pg_index table. It's probably fairly
unlikely, but it is the sort of issue that someone will eventually
discover and report.

Thanks for the tip!
I have wondered about multiple matches, simply choosing the first one
may have some surprise results.
i didn't know that in some cases we use list_sort to make the result
more deterministic.
When there are multiple matches, we need a determined way to choose which one.
so please check attached.

Attachments:

v6-0001-remove-useless-group-by-columns-via-unique-not-nu.patchtext/x-patch; charset=US-ASCII; name=v6-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From ad5ceaa641fcf719815779da5a86d8b0cf7fd70d Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Thu, 28 Nov 2024 20:28:10 +0800
Subject: [PATCH v6 1/1] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spar
---
 src/backend/catalog/index.c                | 147 +++++++++++++++++++++
 src/backend/optimizer/plan/planner.c       |  53 ++++++--
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/aggregates.out   |  59 ++++++++-
 src/test/regress/expected/create_index.out |  11 +-
 src/test/regress/expected/indexing.out     |  11 ++
 src/test/regress/sql/aggregates.sql        |  26 +++-
 src/test/regress/sql/create_index.sql      |   4 +-
 src/test/regress/sql/indexing.sql          |   3 +
 9 files changed, 300 insertions(+), 15 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..947dec0a2f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -56,6 +56,7 @@
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "common/int.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -97,6 +98,14 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+/* Struct for sorting list of Bitmapsets in get_unique_not_null_attnos */
+typedef struct
+{
+	Bitmapset	*unique_notnull_attnos;	/* Bitmapsets contains attribute number
+											that are unique and not null*/
+	Oid			indexoid;				/*the unique index oid */
+} unique_nn_info;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -131,7 +140,19 @@ static void SetReindexProcessing(Oid heapOid, Oid indexOid);
 static void ResetReindexProcessing(void);
 static void SetReindexPending(List *indexes);
 static void RemoveReindexPending(Oid indexOid);
+static int unique_notnull_sort_by_idxoid(const void *a, const void *b);
 
+/*
+ * Comparison function for sorting unique_nn_info in indexoid order.
+ */
+static int
+unique_notnull_sort_by_idxoid(const void *a, const void *b)
+{
+	const unique_nn_info *unique_a = (const unique_nn_info *) a;
+	const unique_nn_info *unique_b = (const unique_nn_info *) b;
+
+	return pg_cmp_u32(unique_a->indexoid, unique_b->indexoid);
+}
 
 /*
  * relationHasPrimaryKey
@@ -4257,3 +4278,129 @@ RestoreReindexState(const void *reindexstate)
 	/* Note the worker has its own transaction nesting level */
 	reindexingNestLevel = GetCurrentTransactionNestLevel();
 }
+
+/*
+ * get_unique_not_null_attnos
+ *		Returns a List of Bitmapsets with the attribute numbers (offset by
+ *		FirstLowInvalidHeapAttributeNumber) of all valid non-partial unique
+ *		indexes which contain only non-nullable columns.
+ */
+List *
+get_unique_not_null_attnos(Oid relid)
+{
+	Bitmapset	*not_null_attnos = NULL;
+	Relation	pg_index;
+	int			j = 0;
+	int			cnt = 0;
+	HeapTuple	indexTuple;
+	SysScanDesc scan;
+	ScanKeyData skey;
+	List 		*not_null_cs;
+	List	   *result = NIL;
+	List	   *indlist = NIL;
+	List	   *indexOidlist = NIL;
+	unique_nn_info *info;
+	ListCell   *lc,
+				*lc2;
+
+	/* Use CookedConstraint to fetch not-null constraint. */
+	not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return NULL;
+
+	foreach_ptr(CookedConstraint, cc, not_null_cs)
+		not_null_attnos = bms_add_member(not_null_attnos, cc->attnum - FirstLowInvalidHeapAttributeNumber);
+
+	/* Scan pg_index for unique index of the target rel */
+	pg_index = table_open(IndexRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey,
+				Anum_pg_index_indrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	scan = systable_beginscan(pg_index, IndexIndrelidIndexId, true,
+							  NULL, 1, &skey);
+
+
+	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+	{
+		Bitmapset  	*unique_notnull_attnos = NULL;
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+		Datum		adatum;
+		bool		isNull;
+		bool		all_notnull = true;
+		int			numkeys;
+		Datum	   *keys;
+		int			nelems;
+
+		/* we're only interested in unique index */
+		if (!index->indisunique)
+			continue;
+
+		/* Skip invalid or deferred index */
+		if (!index->indisvalid || !index->indimmediate)
+			continue;
+
+		/* Skip expression index or predicate index */
+		if (!heap_attisnull(indexTuple, Anum_pg_index_indpred, NULL) ||
+			!heap_attisnull(indexTuple, Anum_pg_index_indexprs, NULL))
+			continue;
+
+		/* Extract the pg_index->indkey int2vector */
+		adatum = heap_getattr(indexTuple, Anum_pg_index_indkey,
+							  RelationGetDescr(pg_index), &isNull);
+		if (isNull)
+			elog(ERROR, "null int2vector for index %u", index->indexrelid);
+
+		deconstruct_array_builtin(DatumGetArrayTypeP(adatum), INT2OID,
+								  &keys,
+								  NULL,
+								  &nelems);
+
+		if (nelems != index->indnatts)
+			elog(ERROR, "corrupted int2vector for index %u", index->indexrelid);
+
+		Assert(nelems >= index->indnkeyatts);
+		numkeys = index->indnkeyatts;
+
+		for (int i = 0; i < numkeys; i++)
+		{
+			int	attno = DatumGetInt16(keys[i]) - FirstLowInvalidHeapAttributeNumber;
+
+			/* ignore this index if any columns are NULLable */
+			if (!bms_is_member(attno, not_null_attnos))
+			{
+				all_notnull = false;
+				break;
+			}
+
+			unique_notnull_attnos = bms_add_member(unique_notnull_attnos, attno);
+		}
+
+		/* if the index has no NULLable columns, add it to the list */
+		if (all_notnull)
+		{
+			indlist = lappend(indlist, unique_notnull_attnos);
+			indexOidlist = lappend_oid(indexOidlist, index->indexrelid);
+		}
+	}
+	systable_endscan(scan);
+
+	cnt = list_length(indlist);
+	info = (unique_nn_info *) palloc0(sizeof(unique_nn_info) * cnt);
+	forboth(lc, indlist, lc2, indexOidlist)
+	{
+		Bitmapset *bmt = lfirst_node(Bitmapset, lc);
+		Oid		idxoid = lfirst_oid(lc2);
+
+		info[j].unique_notnull_attnos = bmt;
+		info[j++].indexoid = idxoid;
+	}
+	qsort(info, cnt, sizeof(unique_nn_info), unique_notnull_sort_by_idxoid);
+
+	for(j = 0; j < cnt; j++)
+		result = lappend(result, info[j].unique_notnull_attnos);
+
+	table_close(pg_index, AccessShareLock);
+	return result;
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..9bbe843ec9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "catalog/index.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
@@ -2805,8 +2806,9 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
 		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
+		Bitmapset  *best_attnos = NULL;
+		List	   *unique_notnull_attnos;
+		int32		best_num_attnos = PG_INT32_MAX;
 
 		relid++;
 
@@ -2828,18 +2830,47 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			continue;
 
 		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
+		 * Fetch details about all suitable unique indexes with NOT NULL
+		 * constraints on all columns.
 		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
+		unique_notnull_attnos = get_unique_not_null_attnos(rte->relid);
 
 		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		foreach_node(Bitmapset,ind_attnos, unique_notnull_attnos)
+		{
+			int			members;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			members = bms_num_members(ind_attnos);
+
+			if (members < best_num_attnos)
+			{
+				best_attnos = ind_attnos;
+				best_num_attnos = members;
+			}
+		}
+
+		list_free(unique_notnull_attnos);
+
+		if (best_attnos != NULL)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -2850,7 +2881,7 @@ remove_useless_groupby_columns(PlannerInfo *root)
 													 (list_length(parse->rtable) + 1));
 
 			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+			surplusvars[relid] = bms_difference(relattnos, best_attnos);
 		}
 	}
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..dacc0f6ebd 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,7 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern List *get_unique_not_null_attnos(Oid relid);
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..f6380438f6 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1349,7 +1349,7 @@ LINE 1: select avg((select avg(a1.col1 order by (select avg(a2.col2)...
 --
 create temp table t1 (a int, b int, c int, d int, primary key (a, b));
 create temp table t2 (x int, y int, z int, primary key (x, y));
-create temp table t3 (a int, b int, c int, primary key(a, b) deferrable);
+create temp table t3 (a int, b int, c int not null, primary key(a, b) deferrable);
 -- Non-primary-key columns can be removed from GROUP BY
 explain (costs off) select * from t1 group by a,b,c,d;
       QUERY PLAN      
@@ -1407,6 +1407,26 @@ explain (costs off) select * from t3 group by a,b,c;
    ->  Seq Scan on t3
 (3 rows)
 
+-- Cannot optimize when unqiue index is deferrable
+ALTER TABLE t3 ADD constraint t3_c_unique unique (c) deferrable initially deferred;
+explain (costs off) select count(*) from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+--- not-null constraint cannot be deferrable, so reduce columns ok.
+ALTER TABLE t3 ADD constraint t3_b_unique unique (b);
+explain (costs off) select * from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b
+   ->  Seq Scan on t3
+(3 rows)
+
 create temp table t1c () inherits (t1);
 -- Ensure we don't remove any columns when t1 has a child table
 explain (costs off) select * from t1 group by a,b,c,d;
@@ -1454,6 +1474,43 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+drop table t2;
+--more than one candicate, choose one will smaller indexoid value.
+create temp table t3(a int not null, b int not null, c int,
+                      constraint na unique(a), constraint nb unique(b));
+explain(costs off) select count(*) from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a
+   ->  Seq Scan on t3
+(3 rows)
+
+alter table t3 drop constraint na;
+alter table t3 add constraint na unique(a);
+explain(costs off) select count(*) from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b
+   ->  Seq Scan on t3
+(3 rows)
+
+drop table t3;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1b0a5f0e9e..27acc50007 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1526,10 +1526,19 @@ DROP TABLE concur_heap;
 --
 -- Test ADD CONSTRAINT USING INDEX
 --
-CREATE TABLE cwi_test( a int , b varchar(10), c char);
+CREATE TABLE cwi_test( a int not null, b varchar(10) not null, c char);
 -- add some data so that all tests have something to work with.
 INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
 CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+--test unique index with not-null can be used to remove other groupby columns.
+explain (costs off) select count(*) from cwi_test group by a,b,c;
+         QUERY PLAN         
+----------------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on cwi_test
+(3 rows)
+
 ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
 \d cwi_test
                      Table "public.cwi_test"
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d7..3f67846480 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1321,6 +1321,17 @@ insert into idxpart2 (c, a, b) values (42, 572814, 'inserted first');
 alter table idxpart2 drop column c;
 create unique index on idxpart (a);
 alter table idxpart attach partition idxpart2 for values from (100000) to (1000000);
+--test using "idxpart_a_idx" to remove useless groupby columns.
+explain(costs off) select count(*) from idxpart group by a, b;
+                 QUERY PLAN                 
+--------------------------------------------
+ HashAggregate
+   Group Key: idxpart.a
+   ->  Append
+         ->  Seq Scan on idxpart1 idxpart_1
+         ->  Seq Scan on idxpart2 idxpart_2
+(5 rows)
+
 insert into idxpart values (0, 'zero'), (42, 'life'), (2^16, 'sixteen');
 insert into idxpart select 2^g, format('two to power of %s', g) from generate_series(15, 17) g;
 ERROR:  duplicate key value violates unique constraint "idxpart1_a_idx"
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..27b57a46c7 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -465,7 +465,7 @@ from tenk1 a2(col2);
 
 create temp table t1 (a int, b int, c int, d int, primary key (a, b));
 create temp table t2 (x int, y int, z int, primary key (x, y));
-create temp table t3 (a int, b int, c int, primary key(a, b) deferrable);
+create temp table t3 (a int, b int, c int not null, primary key(a, b) deferrable);
 
 -- Non-primary-key columns can be removed from GROUP BY
 explain (costs off) select * from t1 group by a,b,c,d;
@@ -486,6 +486,12 @@ group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
 -- Cannot optimize when PK is deferrable
 explain (costs off) select * from t3 group by a,b,c;
 
+-- Cannot optimize when unqiue index is deferrable
+ALTER TABLE t3 ADD constraint t3_c_unique unique (c) deferrable initially deferred;
+explain (costs off) select count(*) from t3 group by b,c;
+--- not-null constraint cannot be deferrable, so reduce columns ok.
+ALTER TABLE t3 ADD constraint t3_b_unique unique (b);
+explain (costs off) select * from t3 group by a,b,c;
 create temp table t1c () inherits (t1);
 
 -- Ensure we don't remove any columns when t1 has a child table
@@ -512,6 +518,24 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+drop table t2;
+
+--more than one candicate, choose one will smaller indexoid value.
+create temp table t3(a int not null, b int not null, c int,
+                      constraint na unique(a), constraint nb unique(b));
+explain(costs off) select count(*) from t3 group by a,b,c;
+alter table t3 drop constraint na;
+alter table t3 add constraint na unique(a);
+explain(costs off) select count(*) from t3 group by a,b,c;
+drop table t3;
+
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index ddd0d9ad39..7d53e5fab2 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -581,13 +581,15 @@ DROP TABLE concur_heap;
 -- Test ADD CONSTRAINT USING INDEX
 --
 
-CREATE TABLE cwi_test( a int , b varchar(10), c char);
+CREATE TABLE cwi_test( a int not null, b varchar(10) not null, c char);
 
 -- add some data so that all tests have something to work with.
 
 INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
 
 CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+--test unique index with not-null can be used to remove other groupby columns.
+explain (costs off) select count(*) from cwi_test group by a,b,c;
 ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
 
 \d cwi_test
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index b5cb01c2d7..3d7233feb0 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -705,6 +705,9 @@ insert into idxpart2 (c, a, b) values (42, 572814, 'inserted first');
 alter table idxpart2 drop column c;
 create unique index on idxpart (a);
 alter table idxpart attach partition idxpart2 for values from (100000) to (1000000);
+--test using "idxpart_a_idx" to remove useless groupby columns.
+explain(costs off) select count(*) from idxpart group by a, b;
+
 insert into idxpart values (0, 'zero'), (42, 'life'), (2^16, 'sixteen');
 insert into idxpart select 2^g, format('two to power of %s', g) from generate_series(15, 17) g;
 insert into idxpart values (16, 'sixteen');
-- 
2.34.1

#13jian he
jian.universality@gmail.com
In reply to: jian he (#12)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

minor changes in get_unique_not_null_attnos:

* cosmetic changes
* if the returned list length is less than 2, no need sort.

Attachments:

v7-0001-remove-useless-group-by-columns-via-unique-not-nu.patchtext/x-patch; charset=US-ASCII; name=v7-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From 58094504364d8a9eb51cbc77a75626aef39b17f0 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Fri, 29 Nov 2024 09:23:27 +0800
Subject: [PATCH v7 1/1] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spar
---
 src/backend/catalog/index.c                | 149 +++++++++++++++++++++
 src/backend/optimizer/plan/planner.c       |  53 ++++++--
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/aggregates.out   |  59 +++++++-
 src/test/regress/expected/create_index.out |  11 +-
 src/test/regress/expected/indexing.out     |  11 ++
 src/test/regress/sql/aggregates.sql        |  26 +++-
 src/test/regress/sql/create_index.sql      |   4 +-
 src/test/regress/sql/indexing.sql          |   3 +
 src/tools/pgindent/typedefs.list           |   1 +
 10 files changed, 303 insertions(+), 15 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..2555997c3c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -56,6 +56,7 @@
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "common/int.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -97,6 +98,14 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+/* Struct for sorting list of Bitmapsets in get_unique_not_null_attnos */
+typedef struct
+{
+	Bitmapset	*unique_notnull_attnos;	/* Bitmapset contains attribute number
+											that are unique and not null*/
+	Oid			indexoid;				/*the unique index oid */
+} unique_nn_info;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -131,7 +140,19 @@ static void SetReindexProcessing(Oid heapOid, Oid indexOid);
 static void ResetReindexProcessing(void);
 static void SetReindexPending(List *indexes);
 static void RemoveReindexPending(Oid indexOid);
+static int unique_notnull_sort_by_idxoid(const void *a, const void *b);
 
+/*
+ * Comparison function for sorting unique_nn_info in indexoid order.
+ */
+static int
+unique_notnull_sort_by_idxoid(const void *a, const void *b)
+{
+	const unique_nn_info *unique_a = (const unique_nn_info *) a;
+	const unique_nn_info *unique_b = (const unique_nn_info *) b;
+
+	return pg_cmp_u32(unique_a->indexoid, unique_b->indexoid);
+}
 
 /*
  * relationHasPrimaryKey
@@ -4257,3 +4278,131 @@ RestoreReindexState(const void *reindexstate)
 	/* Note the worker has its own transaction nesting level */
 	reindexingNestLevel = GetCurrentTransactionNestLevel();
 }
+
+/*
+ * get_unique_not_null_attnos
+ *		Returns a sorted List of Bitmapsets with the attribute numbers (offset
+ *		by FirstLowInvalidHeapAttributeNumber) of all valid non-partial unique
+ *		indexes which contain only non-nullable columns.
+ */
+List *
+get_unique_not_null_attnos(Oid relid)
+{
+	Bitmapset	*not_null_attnos = NULL;
+	Relation	pg_index;
+	int			j = 0;
+	HeapTuple	indexTuple;
+	SysScanDesc scan;
+	ScanKeyData skey;
+	List 		*not_null_cs;
+	List	   *result = NIL;
+	List	   *indlist = NIL;
+	List	   *indexOidlist = NIL;
+	unique_nn_info *info	= NULL;
+	ListCell   *lc,
+				*lc2;
+
+	/* Use CookedConstraint to fetch not-null constraint. */
+	not_null_cs = RelationGetNotNullConstraints(relid, true, false);
+	if (not_null_cs == NIL)
+		return NULL;
+
+	foreach_ptr(CookedConstraint, cc, not_null_cs)
+		not_null_attnos = bms_add_member(not_null_attnos, cc->attnum - FirstLowInvalidHeapAttributeNumber);
+
+	/* Scan pg_index for unique index of the target rel */
+	pg_index = table_open(IndexRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey,
+				Anum_pg_index_indrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+	scan = systable_beginscan(pg_index, IndexIndrelidIndexId, true,
+							  NULL, 1, &skey);
+
+	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+	{
+		Bitmapset  	*unique_notnull_attnos = NULL;
+		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+		Datum		adatum;
+		bool		isNull;
+		bool		all_notnull = true;
+		int			numkeys;
+		Datum	   *keys;
+		int			nelems;
+
+		/* we're only interested in unique index */
+		if (!index->indisunique)
+			continue;
+
+		/* Skip invalid or deferred index */
+		if (!index->indisvalid || !index->indimmediate)
+			continue;
+
+		/* Skip expression index or predicate index */
+		if (!heap_attisnull(indexTuple, Anum_pg_index_indpred, NULL) ||
+			!heap_attisnull(indexTuple, Anum_pg_index_indexprs, NULL))
+			continue;
+
+		/* Extract the pg_index->indkey int2vector */
+		adatum = heap_getattr(indexTuple, Anum_pg_index_indkey,
+							  RelationGetDescr(pg_index), &isNull);
+		if (isNull)
+			elog(ERROR, "null int2vector for index %u", index->indexrelid);
+
+		deconstruct_array_builtin(DatumGetArrayTypeP(adatum), INT2OID,
+								  &keys,
+								  NULL,
+								  &nelems);
+
+		if (nelems != index->indnatts)
+			elog(ERROR, "corrupted int2vector for index %u", index->indexrelid);
+
+		Assert(nelems >= index->indnkeyatts);
+		numkeys = index->indnkeyatts;
+
+		for (int i = 0; i < numkeys; i++)
+		{
+			int	attno = DatumGetInt16(keys[i]) - FirstLowInvalidHeapAttributeNumber;
+
+			/* ignore this index if any columns are NULLable */
+			if (!bms_is_member(attno, not_null_attnos))
+			{
+				all_notnull = false;
+				break;
+			}
+
+			unique_notnull_attnos = bms_add_member(unique_notnull_attnos, attno);
+		}
+
+		/* if the index has no NULLable columns, add it to the list */
+		if (all_notnull)
+		{
+			indlist = lappend(indlist, unique_notnull_attnos);
+			indexOidlist = lappend_oid(indexOidlist, index->indexrelid);
+		}
+	}
+	systable_endscan(scan);
+
+	table_close(pg_index, AccessShareLock);
+
+	/* no need sort if we only have less than two candicate, return as is */
+	if (list_length(indlist) <= 1)
+		return indlist;
+
+	info = (unique_nn_info *) palloc0(sizeof(unique_nn_info) * list_length(indlist));
+	forboth(lc, indlist, lc2, indexOidlist)
+	{
+		Bitmapset *bmt = lfirst_node(Bitmapset, lc);
+		Oid		idxoid = lfirst_oid(lc2);
+
+		info[j].unique_notnull_attnos = bmt;
+		info[j++].indexoid = idxoid;
+	}
+	qsort(info, list_length(indlist), sizeof(unique_nn_info), unique_notnull_sort_by_idxoid);
+
+	for(j = 0; j < list_length(indlist); j++)
+		result = lappend(result, info[j].unique_notnull_attnos);
+
+	return result;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..9bbe843ec9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "catalog/index.h"
 #include "executor/executor.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
@@ -2805,8 +2806,9 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
 		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
+		Bitmapset  *best_attnos = NULL;
+		List	   *unique_notnull_attnos;
+		int32		best_num_attnos = PG_INT32_MAX;
 
 		relid++;
 
@@ -2828,18 +2830,47 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			continue;
 
 		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
+		 * Fetch details about all suitable unique indexes with NOT NULL
+		 * constraints on all columns.
 		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
+		unique_notnull_attnos = get_unique_not_null_attnos(rte->relid);
 
 		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		foreach_node(Bitmapset,ind_attnos, unique_notnull_attnos)
+		{
+			int			members;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			members = bms_num_members(ind_attnos);
+
+			if (members < best_num_attnos)
+			{
+				best_attnos = ind_attnos;
+				best_num_attnos = members;
+			}
+		}
+
+		list_free(unique_notnull_attnos);
+
+		if (best_attnos != NULL)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -2850,7 +2881,7 @@ remove_useless_groupby_columns(PlannerInfo *root)
 													 (list_length(parse->rtable) + 1));
 
 			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+			surplusvars[relid] = bms_difference(relattnos, best_attnos);
 		}
 	}
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..dacc0f6ebd 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,7 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern List *get_unique_not_null_attnos(Oid relid);
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..f6380438f6 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1349,7 +1349,7 @@ LINE 1: select avg((select avg(a1.col1 order by (select avg(a2.col2)...
 --
 create temp table t1 (a int, b int, c int, d int, primary key (a, b));
 create temp table t2 (x int, y int, z int, primary key (x, y));
-create temp table t3 (a int, b int, c int, primary key(a, b) deferrable);
+create temp table t3 (a int, b int, c int not null, primary key(a, b) deferrable);
 -- Non-primary-key columns can be removed from GROUP BY
 explain (costs off) select * from t1 group by a,b,c,d;
       QUERY PLAN      
@@ -1407,6 +1407,26 @@ explain (costs off) select * from t3 group by a,b,c;
    ->  Seq Scan on t3
 (3 rows)
 
+-- Cannot optimize when unqiue index is deferrable
+ALTER TABLE t3 ADD constraint t3_c_unique unique (c) deferrable initially deferred;
+explain (costs off) select count(*) from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+--- not-null constraint cannot be deferrable, so reduce columns ok.
+ALTER TABLE t3 ADD constraint t3_b_unique unique (b);
+explain (costs off) select * from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b
+   ->  Seq Scan on t3
+(3 rows)
+
 create temp table t1c () inherits (t1);
 -- Ensure we don't remove any columns when t1 has a child table
 explain (costs off) select * from t1 group by a,b,c,d;
@@ -1454,6 +1474,43 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 --
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t2
+(3 rows)
+
+drop table t2;
+--more than one candicate, choose one will smaller indexoid value.
+create temp table t3(a int not null, b int not null, c int,
+                      constraint na unique(a), constraint nb unique(b));
+explain(costs off) select count(*) from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a
+   ->  Seq Scan on t3
+(3 rows)
+
+alter table t3 drop constraint na;
+alter table t3 add constraint na unique(a);
+explain(costs off) select count(*) from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b
+   ->  Seq Scan on t3
+(3 rows)
+
+drop table t3;
+--
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
 create temp table t1(f1 int, f2 int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1b0a5f0e9e..27acc50007 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1526,10 +1526,19 @@ DROP TABLE concur_heap;
 --
 -- Test ADD CONSTRAINT USING INDEX
 --
-CREATE TABLE cwi_test( a int , b varchar(10), c char);
+CREATE TABLE cwi_test( a int not null, b varchar(10) not null, c char);
 -- add some data so that all tests have something to work with.
 INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
 CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+--test unique index with not-null can be used to remove other groupby columns.
+explain (costs off) select count(*) from cwi_test group by a,b,c;
+         QUERY PLAN         
+----------------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on cwi_test
+(3 rows)
+
 ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
 \d cwi_test
                      Table "public.cwi_test"
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d7..3f67846480 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1321,6 +1321,17 @@ insert into idxpart2 (c, a, b) values (42, 572814, 'inserted first');
 alter table idxpart2 drop column c;
 create unique index on idxpart (a);
 alter table idxpart attach partition idxpart2 for values from (100000) to (1000000);
+--test using "idxpart_a_idx" to remove useless groupby columns.
+explain(costs off) select count(*) from idxpart group by a, b;
+                 QUERY PLAN                 
+--------------------------------------------
+ HashAggregate
+   Group Key: idxpart.a
+   ->  Append
+         ->  Seq Scan on idxpart1 idxpart_1
+         ->  Seq Scan on idxpart2 idxpart_2
+(5 rows)
+
 insert into idxpart values (0, 'zero'), (42, 'life'), (2^16, 'sixteen');
 insert into idxpart select 2^g, format('two to power of %s', g) from generate_series(15, 17) g;
 ERROR:  duplicate key value violates unique constraint "idxpart1_a_idx"
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..27b57a46c7 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -465,7 +465,7 @@ from tenk1 a2(col2);
 
 create temp table t1 (a int, b int, c int, d int, primary key (a, b));
 create temp table t2 (x int, y int, z int, primary key (x, y));
-create temp table t3 (a int, b int, c int, primary key(a, b) deferrable);
+create temp table t3 (a int, b int, c int not null, primary key(a, b) deferrable);
 
 -- Non-primary-key columns can be removed from GROUP BY
 explain (costs off) select * from t1 group by a,b,c,d;
@@ -486,6 +486,12 @@ group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
 -- Cannot optimize when PK is deferrable
 explain (costs off) select * from t3 group by a,b,c;
 
+-- Cannot optimize when unqiue index is deferrable
+ALTER TABLE t3 ADD constraint t3_c_unique unique (c) deferrable initially deferred;
+explain (costs off) select count(*) from t3 group by b,c;
+--- not-null constraint cannot be deferrable, so reduce columns ok.
+ALTER TABLE t3 ADD constraint t3_b_unique unique (b);
+explain (costs off) select * from t3 group by a,b,c;
 create temp table t1c () inherits (t1);
 
 -- Ensure we don't remove any columns when t1 has a child table
@@ -512,6 +518,24 @@ drop table t2;
 drop table t3;
 drop table p_t1;
 
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+--
+create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));
+-- using unique index remove useless group by columns
+-- since it has less columns compare with primary key
+explain (costs off) select * from t2 group by a,b,c;
+drop table t2;
+
+--more than one candicate, choose one will smaller indexoid value.
+create temp table t3(a int not null, b int not null, c int,
+                      constraint na unique(a), constraint nb unique(b));
+explain(costs off) select count(*) from t3 group by a,b,c;
+alter table t3 drop constraint na;
+alter table t3 add constraint na unique(a);
+explain(costs off) select count(*) from t3 group by a,b,c;
+drop table t3;
+
 --
 -- Test GROUP BY matching of join columns that are type-coerced due to USING
 --
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index ddd0d9ad39..7d53e5fab2 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -581,13 +581,15 @@ DROP TABLE concur_heap;
 -- Test ADD CONSTRAINT USING INDEX
 --
 
-CREATE TABLE cwi_test( a int , b varchar(10), c char);
+CREATE TABLE cwi_test( a int not null, b varchar(10) not null, c char);
 
 -- add some data so that all tests have something to work with.
 
 INSERT INTO cwi_test VALUES(1, 2), (3, 4), (5, 6);
 
 CREATE UNIQUE INDEX cwi_uniq_idx ON cwi_test(a , b);
+--test unique index with not-null can be used to remove other groupby columns.
+explain (costs off) select count(*) from cwi_test group by a,b,c;
 ALTER TABLE cwi_test ADD primary key USING INDEX cwi_uniq_idx;
 
 \d cwi_test
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index b5cb01c2d7..3d7233feb0 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -705,6 +705,9 @@ insert into idxpart2 (c, a, b) values (42, 572814, 'inserted first');
 alter table idxpart2 drop column c;
 create unique index on idxpart (a);
 alter table idxpart attach partition idxpart2 for values from (100000) to (1000000);
+--test using "idxpart_a_idx" to remove useless groupby columns.
+explain(costs off) select count(*) from idxpart group by a, b;
+
 insert into idxpart values (0, 'zero'), (42, 'life'), (2^16, 'sixteen');
 insert into idxpart select 2^g, format('two to power of %s', g) from generate_series(15, 17) g;
 insert into idxpart values (16, 'sixteen');
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b54428b38c..9632bbeefe 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4026,6 +4026,7 @@ unicodeStyleColumnFormat
 unicodeStyleFormat
 unicodeStyleRowFormat
 unicode_linestyle
+unique_nn_info
 unit_conversion
 unlogged_relation_entry
 utf_local_conversion_func
-- 
2.34.1

#14David Rowley
dgrowleyml@gmail.com
In reply to: jian he (#12)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 01:33, jian he <jian.universality@gmail.com> wrote:

On Thu, Nov 28, 2024 at 2:11 PM David Rowley <dgrowleyml@gmail.com> wrote:

I think it'll make more sense to adjust some of the
existing tests to use a unique constraint instead of a PK and then
adjust a column's NOT NULL property to check that part of the code is
working correctly.

looking around, i inserted some tests to indexing.sql,
create_index.sql, aggregates.sql
also added tests for sort by index oid.

I meant the tests added by d4c3a156c. I don't think there's any need
to have tests in more than one file.

Another issue with this is that the list of indexes being used is not
sorted by Oid.

When there are multiple matches, we need a determined way to choose which one.
so please check attached.

Looking closer at get_relation_info(), it looks like the
RelOptInfo.indexlist will be in descending Oid order, per:

/*
* We've historically used lcons() here. It'd make more sense to
* use lappend(), but that causes the planner to change behavior
* in cases where two indexes seem equally attractive. For now,
* stick with lcons() --- few tables should have so many indexes
* that the O(N^2) behavior of lcons() is really a problem.
*/
indexinfos = lcons(info, indexinfos);

I'm starting to feel like we might be trying to include too much logic
to mimic this behaviour. The main problem is that the RelOptInfos
have not yet been created for the given relation when
remove_useless_groupby_columns() is called. That's why we're having to
query the catalogue tables like this. I do see a comment in
preprocess_groupclauses() that mentions:

* Note: we return a fresh List, but its elements are the same
* SortGroupClauses appearing in parse->groupClause. This is important
* because later processing may modify the processed_groupClause list.

So there's already a warning that there may still be future changes to
the processed_groupClause. It'd be nice if we could delay calling
remove_useless_groupby_columns() until the RelOptInfos are built as
we'd not have to query the catalogue tables to make this work. The
soonest point is just after the call to add_base_rels_to_query() in
query_planner(). That would mean doing more work on the
processed_groupClause in query_planner() rather than in
grouping_planner(), which is a little strange. It looks like the first
time we look at processed_groupClause and make a decision based on the
actual value of it is in standard_qp_callback(), i.e. after
add_base_rels_to_query().

I've attached an updated patch that gets rid of the
get_unique_not_null_attnos() function completely and uses the
RelOptInfo.indexlist and RelOptInfo.notnullattnums. I also realised
that there's no need to check for NOT NULL constraints with unique
indexes defined with NULLS NOT DISTINCT as having NULLs unique means
we can still determine the functional dependency. One other possibly
good outcome of this is that it's much easier to do the
pg_statistic.avgwidth thing I talked about as RelOptInfo stores those
in attr_widths, however, looking at
table_block_relation_estimate_size(), I see we might not always
populate those.

I ended up re-homing remove_useless_groupby_columns() in initsplan.c.
I couldn't see anywhere else it fitted.

David

Attachments:

v8-0001-remove-useless-group-by-columns-via-unique-not-nu.patchapplication/octet-stream; name=v8-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From 03bf71b1a794f1f6824da591876dae720797bcc5 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Thu, 28 Nov 2024 20:28:10 +0800
Subject: [PATCH v8] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spar
---
 src/backend/optimizer/plan/initsplan.c   | 231 ++++++++++++++++++++++-
 src/backend/optimizer/plan/planmain.c    |   3 +
 src/backend/optimizer/plan/planner.c     | 163 ----------------
 src/backend/optimizer/util/plancat.c     |   1 +
 src/include/nodes/pathnodes.h            |   2 +
 src/include/optimizer/planmain.h         |   1 +
 src/test/regress/expected/aggregates.out |  35 ++++
 src/test/regress/sql/aggregates.sql      |  18 ++
 8 files changed, 290 insertions(+), 164 deletions(-)

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 903c397d40..d5cd6b909c 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * initsplan.c
- *	  Target list, qualification, joininfo initialization routines
+ *	  Target list, group by, qualification, joininfo initialization routines
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -387,6 +387,235 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 }
 
 
+/*****************************************************************************
+ *
+ *	  GROUP BY
+ *
+ *****************************************************************************/
+
+/*
+ * remove_useless_groupby_columns
+ *		Remove any columns in the GROUP BY clause that are redundant due to
+ *		being functionally dependent on other GROUP BY columns.
+ *
+ * Since some other DBMSes do not allow references to ungrouped columns, it's
+ * not unusual to find all columns listed in GROUP BY even though listing the
+ * primary-key columns, or columns of a unique constraint would be sufficient.
+ * Deleting such excess columns avoids redundant sorting or hashing work, so
+ * it's worth doing.
+ *
+ * Relcache invalidations will ensure that cached plans become invalidated
+ * when the underlying supporting indexes are dropped or if a column's NOT
+ * NULL attribute is removed.
+ */
+void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	Query	   *parse = root->parse;
+	Bitmapset **groupbyattnos;
+	Bitmapset **surplusvars;
+	ListCell   *lc;
+	int			relid;
+
+	/* No chance to do anything if there are less than two GROUP BY items */
+	if (list_length(root->processed_groupClause) < 2)
+		return;
+
+	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
+	if (parse->groupingSets)
+		return;
+
+	/*
+	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
+	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
+	 * that are GROUP BY items.
+	 */
+	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) + 1));
+	foreach(lc, root->processed_groupClause)
+	{
+		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		Var		   *var = (Var *) tle->expr;
+
+		/*
+		 * Ignore non-Vars and Vars from other query levels.
+		 *
+		 * XXX in principle, stable expressions containing Vars could also be
+		 * removed, if all the Vars are functionally dependent on other GROUP
+		 * BY items.  But it's not clear that such cases occur often enough to
+		 * be worth troubling over.
+		 */
+		if (!IsA(var, Var) ||
+			var->varlevelsup > 0)
+			continue;
+
+		/* OK, remember we have this Var */
+		relid = var->varno;
+		Assert(relid <= list_length(parse->rtable));
+		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
+											  var->varattno - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/*
+	 * Consider each relation and see if it is possible to remove some of its
+	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
+	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
+	 * of the column attnos of RTE k that are removable GROUP BY items.
+	 */
+	surplusvars = NULL;			/* don't allocate array unless required */
+	relid = 0;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
+		RelOptInfo *rel;
+		Bitmapset  *relattnos;
+		Bitmapset  *best_keycolumns = NULL;
+		int32		best_nkeycolumns = PG_INT32_MAX;
+
+		relid++;
+
+		/* Only plain relations could have primary-key constraints */
+		if (rte->rtekind != RTE_RELATION)
+			continue;
+
+		/*
+		 * We must skip inheritance parent tables as some of the child rels
+		 * may cause duplicate rows.  This cannot happen with partitioned
+		 * tables, however.
+		 */
+		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
+		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
+		relattnos = groupbyattnos[relid];
+		if (bms_membership(relattnos) != BMS_MULTIPLE)
+			continue;
+
+		rel = root->simple_rel_array[relid];
+
+		/*
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
+		 */
+		foreach_node(IndexOptInfo, index, rel->indexlist)
+		{
+			Bitmapset  *ind_attnos;
+			bool		nulls_check_ok;
+
+			/*
+			 * Skip any non-unique and deferrable indexes.  Predicate indexes
+			 * have not been checked yet, so we must skip those too as the
+			 * predOK check might fail.
+			 */
+			if (!index->unique || !index->immediate || index->indpred != NIL)
+				continue;
+
+			/* For simplicity we currently don't support expression indexes */
+			if (index->indexprs != NIL)
+				continue;
+
+			ind_attnos = NULL;
+			nulls_check_ok = true;
+			for (int i = 0; i < index->nkeycolumns; i++)
+			{
+				/*
+				 * We must insist that the index columns are all defined NOT
+				 * NULL otherwise duplicate NULLs could exists.  However, we
+				 * can relax this check when the index is defined with NULLS
+				 * NOT DISTINCT as there can only be 1 NULL row, therefore
+				 * functional dependency on the unique columns is maintained,
+				 * despite the NULL.
+				 */
+				if (!index->nullsnotdistinct &&
+					!bms_is_member(index->indexkeys[i],
+								   rel->notnullattnums))
+				{
+					nulls_check_ok = false;
+					break;
+				}
+
+				ind_attnos =
+					bms_add_member(ind_attnos,
+								   index->indexkeys[i] -
+								   FirstLowInvalidHeapAttributeNumber);
+			}
+
+			if (!nulls_check_ok)
+				continue;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			if (index->nkeycolumns < best_nkeycolumns)
+			{
+				best_keycolumns = ind_attnos;
+				best_nkeycolumns = index->nkeycolumns;
+			}
+		}
+
+		/* Did we find a suitable index? */
+		if (best_keycolumns != NULL)
+		{
+			/*
+			 * To easily remember whether we've found anything to do, we don't
+			 * allocate the surplusvars[] array until we find something.
+			 */
+			if (surplusvars == NULL)
+				surplusvars =
+					(Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) +
+											1));
+
+			/* Remember the attnos of the removable columns */
+			surplusvars[relid] = bms_difference(relattnos, best_keycolumns);
+		}
+	}
+
+	/*
+	 * If we found any surplus Vars, build a new GROUP BY clause without them.
+	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
+	 * markings, but that's harmless.)
+	 */
+	if (surplusvars != NULL)
+	{
+		List	   *new_groupby = NIL;
+
+		foreach(lc, root->processed_groupClause)
+		{
+			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+			Var		   *var = (Var *) tle->expr;
+
+			/*
+			 * New list must include non-Vars, outer Vars, and anything not
+			 * marked as surplus.
+			 */
+			if (!IsA(var, Var) || var->varlevelsup > 0 ||
+				!bms_is_member(var->varattno -
+							   FirstLowInvalidHeapAttributeNumber,
+							   surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+		}
+
+		root->processed_groupClause = new_groupby;
+	}
+}
+
+
 /*****************************************************************************
  *
  *	  LATERAL REFERENCES
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index e17d31a5c3..735560e8ca 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -169,6 +169,9 @@ query_planner(PlannerInfo *root,
 	 */
 	add_base_rels_to_query(root, (Node *) parse->jointree);
 
+	/* Remove any redundant GROUP BY columns */
+	remove_useless_groupby_columns(root);
+
 	/*
 	 * Examine the targetlist and join tree, adding entries to baserel
 	 * targetlists for all referenced Vars, and generating PlaceHolderInfo
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..352c4edf3a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -139,7 +139,6 @@ static void preprocess_rowmarks(PlannerInfo *root);
 static double preprocess_limit(PlannerInfo *root,
 							   double tuple_fraction,
 							   int64 *offset_est, int64 *count_est);
-static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
 static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
@@ -1487,8 +1486,6 @@ grouping_planner(PlannerInfo *root, double tuple_fraction,
 		{
 			/* Preprocess regular GROUP BY clause, if any */
 			root->processed_groupClause = preprocess_groupclause(root, NIL);
-			/* Remove any redundant GROUP BY columns */
-			remove_useless_groupby_columns(root);
 		}
 
 		/*
@@ -2724,166 +2721,6 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
-
-/*
- * remove_useless_groupby_columns
- *		Remove any columns in the GROUP BY clause that are redundant due to
- *		being functionally dependent on other GROUP BY columns.
- *
- * Since some other DBMSes do not allow references to ungrouped columns, it's
- * not unusual to find all columns listed in GROUP BY even though listing the
- * primary-key columns would be sufficient.  Deleting such excess columns
- * avoids redundant sorting work, so it's worth doing.
- *
- * Relcache invalidations will ensure that cached plans become invalidated
- * when the underlying index of the pkey constraint is dropped.
- *
- * Currently, we only make use of pkey constraints for this, however, we may
- * wish to take this further in the future and also use unique constraints
- * which have NOT NULL columns.  In that case, plan invalidation will still
- * work since relations will receive a relcache invalidation when a NOT NULL
- * constraint is dropped.
- */
-static void
-remove_useless_groupby_columns(PlannerInfo *root)
-{
-	Query	   *parse = root->parse;
-	Bitmapset **groupbyattnos;
-	Bitmapset **surplusvars;
-	ListCell   *lc;
-	int			relid;
-
-	/* No chance to do anything if there are less than two GROUP BY items */
-	if (list_length(root->processed_groupClause) < 2)
-		return;
-
-	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
-	if (parse->groupingSets)
-		return;
-
-	/*
-	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
-	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
-	 * that are GROUP BY items.
-	 */
-	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-										   (list_length(parse->rtable) + 1));
-	foreach(lc, root->processed_groupClause)
-	{
-		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-		Var		   *var = (Var *) tle->expr;
-
-		/*
-		 * Ignore non-Vars and Vars from other query levels.
-		 *
-		 * XXX in principle, stable expressions containing Vars could also be
-		 * removed, if all the Vars are functionally dependent on other GROUP
-		 * BY items.  But it's not clear that such cases occur often enough to
-		 * be worth troubling over.
-		 */
-		if (!IsA(var, Var) ||
-			var->varlevelsup > 0)
-			continue;
-
-		/* OK, remember we have this Var */
-		relid = var->varno;
-		Assert(relid <= list_length(parse->rtable));
-		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
-											  var->varattno - FirstLowInvalidHeapAttributeNumber);
-	}
-
-	/*
-	 * Consider each relation and see if it is possible to remove some of its
-	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
-	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
-	 * of the column attnos of RTE k that are removable GROUP BY items.
-	 */
-	surplusvars = NULL;			/* don't allocate array unless required */
-	relid = 0;
-	foreach(lc, parse->rtable)
-	{
-		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
-		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
-
-		relid++;
-
-		/* Only plain relations could have primary-key constraints */
-		if (rte->rtekind != RTE_RELATION)
-			continue;
-
-		/*
-		 * We must skip inheritance parent tables as some of the child rels
-		 * may cause duplicate rows.  This cannot happen with partitioned
-		 * tables, however.
-		 */
-		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
-			continue;
-
-		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
-		relattnos = groupbyattnos[relid];
-		if (bms_membership(relattnos) != BMS_MULTIPLE)
-			continue;
-
-		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
-		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
-
-		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
-		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
-		{
-			/*
-			 * To easily remember whether we've found anything to do, we don't
-			 * allocate the surplusvars[] array until we find something.
-			 */
-			if (surplusvars == NULL)
-				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-													 (list_length(parse->rtable) + 1));
-
-			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
-		}
-	}
-
-	/*
-	 * If we found any surplus Vars, build a new GROUP BY clause without them.
-	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
-	 * markings, but that's harmless.)
-	 */
-	if (surplusvars != NULL)
-	{
-		List	   *new_groupby = NIL;
-
-		foreach(lc, root->processed_groupClause)
-		{
-			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-			Var		   *var = (Var *) tle->expr;
-
-			/*
-			 * New list must include non-Vars, outer Vars, and anything not
-			 * marked as surplus.
-			 */
-			if (!IsA(var, Var) ||
-				var->varlevelsup > 0 ||
-				!bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
-							   surplusvars[var->varno]))
-				new_groupby = lappend(new_groupby, sgc);
-		}
-
-		root->processed_groupClause = new_groupby;
-	}
-}
-
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
  *
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e43..153390f2dc 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -457,6 +457,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->indrestrictinfo = NIL;	/* set later, in indxpath.c */
 			info->predOK = false;	/* set later, in indxpath.c */
 			info->unique = index->indisunique;
+			info->nullsnotdistinct = index->indnullsnotdistinct;
 			info->immediate = index->indimmediate;
 			info->hypothetical = false;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..0759e00e96 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1187,6 +1187,8 @@ struct IndexOptInfo
 	bool		predOK;
 	/* true if a unique index */
 	bool		unique;
+	/* true if the index was defined with NULLS NOT DISTINCT */
+	bool		nullsnotdistinct;
 	/* is uniqueness enforced immediately? */
 	bool		immediate;
 	/* true if index doesn't really exist */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 93137261e4..0b6f0f7969 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -74,6 +74,7 @@ extern void add_vars_to_targetlist(PlannerInfo *root, List *vars,
 								   Relids where_needed);
 extern void add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 									Relids where_needed);
+extern void remove_useless_groupby_columns(PlannerInfo *root);
 extern void find_lateral_references(PlannerInfo *root);
 extern void rebuild_lateral_attr_needed(PlannerInfo *root);
 extern void create_lateral_join_info(PlannerInfo *root);
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..4088f418ff 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1448,6 +1448,41 @@ explain (costs off) select * from p_t1 group by a,b,c,d;
          ->  Seq Scan on p_t1_2
 (5 rows)
 
+create unique index t3_c_uidx on t3(c);
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
 drop table t1 cascade;
 NOTICE:  drop cascades to table t1c
 drop table t2;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..e39b20f450 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -507,6 +507,24 @@ create temp table p_t1_2 partition of p_t1 for values in(2);
 -- Ensure we can remove non-PK columns for partitioned tables.
 explain (costs off) select * from p_t1 group by a,b,c,d;
 
+create unique index t3_c_uidx on t3(c);
+
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+
 drop table t1 cascade;
 drop table t2;
 drop table t3;
-- 
2.34.1

#15David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#14)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 15:02, David Rowley <dgrowleyml@gmail.com> wrote:

I've attached an updated patch that gets rid of the
get_unique_not_null_attnos() function completely and uses the
RelOptInfo.indexlist and RelOptInfo.notnullattnums.

I forgot to do a local commit before sending v8. Fixed in the attached v9.

David

Attachments:

v9-0001-remove-useless-group-by-columns-via-unique-not-nu.patchapplication/octet-stream; name=v9-0001-remove-useless-group-by-columns-via-unique-not-nu.patchDownload
From bb772626cc0f86376a62ccf7b18a0d3922b982e7 Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Thu, 28 Nov 2024 20:28:10 +0800
Subject: [PATCH v9] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spar
---
 src/backend/optimizer/plan/initsplan.c   | 231 ++++++++++++++++++++++-
 src/backend/optimizer/plan/planmain.c    |   3 +
 src/backend/optimizer/plan/planner.c     | 163 ----------------
 src/backend/optimizer/util/plancat.c     |   1 +
 src/include/nodes/pathnodes.h            |   2 +
 src/include/optimizer/planmain.h         |   1 +
 src/test/regress/expected/aggregates.out |  56 ++++++
 src/test/regress/sql/aggregates.sql      |  27 +++
 8 files changed, 320 insertions(+), 164 deletions(-)

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 903c397d40..d5cd6b909c 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * initsplan.c
- *	  Target list, qualification, joininfo initialization routines
+ *	  Target list, group by, qualification, joininfo initialization routines
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -387,6 +387,235 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 }
 
 
+/*****************************************************************************
+ *
+ *	  GROUP BY
+ *
+ *****************************************************************************/
+
+/*
+ * remove_useless_groupby_columns
+ *		Remove any columns in the GROUP BY clause that are redundant due to
+ *		being functionally dependent on other GROUP BY columns.
+ *
+ * Since some other DBMSes do not allow references to ungrouped columns, it's
+ * not unusual to find all columns listed in GROUP BY even though listing the
+ * primary-key columns, or columns of a unique constraint would be sufficient.
+ * Deleting such excess columns avoids redundant sorting or hashing work, so
+ * it's worth doing.
+ *
+ * Relcache invalidations will ensure that cached plans become invalidated
+ * when the underlying supporting indexes are dropped or if a column's NOT
+ * NULL attribute is removed.
+ */
+void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	Query	   *parse = root->parse;
+	Bitmapset **groupbyattnos;
+	Bitmapset **surplusvars;
+	ListCell   *lc;
+	int			relid;
+
+	/* No chance to do anything if there are less than two GROUP BY items */
+	if (list_length(root->processed_groupClause) < 2)
+		return;
+
+	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
+	if (parse->groupingSets)
+		return;
+
+	/*
+	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
+	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
+	 * that are GROUP BY items.
+	 */
+	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) + 1));
+	foreach(lc, root->processed_groupClause)
+	{
+		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		Var		   *var = (Var *) tle->expr;
+
+		/*
+		 * Ignore non-Vars and Vars from other query levels.
+		 *
+		 * XXX in principle, stable expressions containing Vars could also be
+		 * removed, if all the Vars are functionally dependent on other GROUP
+		 * BY items.  But it's not clear that such cases occur often enough to
+		 * be worth troubling over.
+		 */
+		if (!IsA(var, Var) ||
+			var->varlevelsup > 0)
+			continue;
+
+		/* OK, remember we have this Var */
+		relid = var->varno;
+		Assert(relid <= list_length(parse->rtable));
+		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
+											  var->varattno - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/*
+	 * Consider each relation and see if it is possible to remove some of its
+	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
+	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
+	 * of the column attnos of RTE k that are removable GROUP BY items.
+	 */
+	surplusvars = NULL;			/* don't allocate array unless required */
+	relid = 0;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
+		RelOptInfo *rel;
+		Bitmapset  *relattnos;
+		Bitmapset  *best_keycolumns = NULL;
+		int32		best_nkeycolumns = PG_INT32_MAX;
+
+		relid++;
+
+		/* Only plain relations could have primary-key constraints */
+		if (rte->rtekind != RTE_RELATION)
+			continue;
+
+		/*
+		 * We must skip inheritance parent tables as some of the child rels
+		 * may cause duplicate rows.  This cannot happen with partitioned
+		 * tables, however.
+		 */
+		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
+		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
+		relattnos = groupbyattnos[relid];
+		if (bms_membership(relattnos) != BMS_MULTIPLE)
+			continue;
+
+		rel = root->simple_rel_array[relid];
+
+		/*
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
+		 */
+		foreach_node(IndexOptInfo, index, rel->indexlist)
+		{
+			Bitmapset  *ind_attnos;
+			bool		nulls_check_ok;
+
+			/*
+			 * Skip any non-unique and deferrable indexes.  Predicate indexes
+			 * have not been checked yet, so we must skip those too as the
+			 * predOK check might fail.
+			 */
+			if (!index->unique || !index->immediate || index->indpred != NIL)
+				continue;
+
+			/* For simplicity we currently don't support expression indexes */
+			if (index->indexprs != NIL)
+				continue;
+
+			ind_attnos = NULL;
+			nulls_check_ok = true;
+			for (int i = 0; i < index->nkeycolumns; i++)
+			{
+				/*
+				 * We must insist that the index columns are all defined NOT
+				 * NULL otherwise duplicate NULLs could exists.  However, we
+				 * can relax this check when the index is defined with NULLS
+				 * NOT DISTINCT as there can only be 1 NULL row, therefore
+				 * functional dependency on the unique columns is maintained,
+				 * despite the NULL.
+				 */
+				if (!index->nullsnotdistinct &&
+					!bms_is_member(index->indexkeys[i],
+								   rel->notnullattnums))
+				{
+					nulls_check_ok = false;
+					break;
+				}
+
+				ind_attnos =
+					bms_add_member(ind_attnos,
+								   index->indexkeys[i] -
+								   FirstLowInvalidHeapAttributeNumber);
+			}
+
+			if (!nulls_check_ok)
+				continue;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			if (index->nkeycolumns < best_nkeycolumns)
+			{
+				best_keycolumns = ind_attnos;
+				best_nkeycolumns = index->nkeycolumns;
+			}
+		}
+
+		/* Did we find a suitable index? */
+		if (best_keycolumns != NULL)
+		{
+			/*
+			 * To easily remember whether we've found anything to do, we don't
+			 * allocate the surplusvars[] array until we find something.
+			 */
+			if (surplusvars == NULL)
+				surplusvars =
+					(Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) +
+											1));
+
+			/* Remember the attnos of the removable columns */
+			surplusvars[relid] = bms_difference(relattnos, best_keycolumns);
+		}
+	}
+
+	/*
+	 * If we found any surplus Vars, build a new GROUP BY clause without them.
+	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
+	 * markings, but that's harmless.)
+	 */
+	if (surplusvars != NULL)
+	{
+		List	   *new_groupby = NIL;
+
+		foreach(lc, root->processed_groupClause)
+		{
+			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+			Var		   *var = (Var *) tle->expr;
+
+			/*
+			 * New list must include non-Vars, outer Vars, and anything not
+			 * marked as surplus.
+			 */
+			if (!IsA(var, Var) || var->varlevelsup > 0 ||
+				!bms_is_member(var->varattno -
+							   FirstLowInvalidHeapAttributeNumber,
+							   surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+		}
+
+		root->processed_groupClause = new_groupby;
+	}
+}
+
+
 /*****************************************************************************
  *
  *	  LATERAL REFERENCES
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index e17d31a5c3..735560e8ca 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -169,6 +169,9 @@ query_planner(PlannerInfo *root,
 	 */
 	add_base_rels_to_query(root, (Node *) parse->jointree);
 
+	/* Remove any redundant GROUP BY columns */
+	remove_useless_groupby_columns(root);
+
 	/*
 	 * Examine the targetlist and join tree, adding entries to baserel
 	 * targetlists for all referenced Vars, and generating PlaceHolderInfo
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..352c4edf3a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -139,7 +139,6 @@ static void preprocess_rowmarks(PlannerInfo *root);
 static double preprocess_limit(PlannerInfo *root,
 							   double tuple_fraction,
 							   int64 *offset_est, int64 *count_est);
-static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
 static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
@@ -1487,8 +1486,6 @@ grouping_planner(PlannerInfo *root, double tuple_fraction,
 		{
 			/* Preprocess regular GROUP BY clause, if any */
 			root->processed_groupClause = preprocess_groupclause(root, NIL);
-			/* Remove any redundant GROUP BY columns */
-			remove_useless_groupby_columns(root);
 		}
 
 		/*
@@ -2724,166 +2721,6 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
-
-/*
- * remove_useless_groupby_columns
- *		Remove any columns in the GROUP BY clause that are redundant due to
- *		being functionally dependent on other GROUP BY columns.
- *
- * Since some other DBMSes do not allow references to ungrouped columns, it's
- * not unusual to find all columns listed in GROUP BY even though listing the
- * primary-key columns would be sufficient.  Deleting such excess columns
- * avoids redundant sorting work, so it's worth doing.
- *
- * Relcache invalidations will ensure that cached plans become invalidated
- * when the underlying index of the pkey constraint is dropped.
- *
- * Currently, we only make use of pkey constraints for this, however, we may
- * wish to take this further in the future and also use unique constraints
- * which have NOT NULL columns.  In that case, plan invalidation will still
- * work since relations will receive a relcache invalidation when a NOT NULL
- * constraint is dropped.
- */
-static void
-remove_useless_groupby_columns(PlannerInfo *root)
-{
-	Query	   *parse = root->parse;
-	Bitmapset **groupbyattnos;
-	Bitmapset **surplusvars;
-	ListCell   *lc;
-	int			relid;
-
-	/* No chance to do anything if there are less than two GROUP BY items */
-	if (list_length(root->processed_groupClause) < 2)
-		return;
-
-	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
-	if (parse->groupingSets)
-		return;
-
-	/*
-	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
-	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
-	 * that are GROUP BY items.
-	 */
-	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-										   (list_length(parse->rtable) + 1));
-	foreach(lc, root->processed_groupClause)
-	{
-		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-		Var		   *var = (Var *) tle->expr;
-
-		/*
-		 * Ignore non-Vars and Vars from other query levels.
-		 *
-		 * XXX in principle, stable expressions containing Vars could also be
-		 * removed, if all the Vars are functionally dependent on other GROUP
-		 * BY items.  But it's not clear that such cases occur often enough to
-		 * be worth troubling over.
-		 */
-		if (!IsA(var, Var) ||
-			var->varlevelsup > 0)
-			continue;
-
-		/* OK, remember we have this Var */
-		relid = var->varno;
-		Assert(relid <= list_length(parse->rtable));
-		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
-											  var->varattno - FirstLowInvalidHeapAttributeNumber);
-	}
-
-	/*
-	 * Consider each relation and see if it is possible to remove some of its
-	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
-	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
-	 * of the column attnos of RTE k that are removable GROUP BY items.
-	 */
-	surplusvars = NULL;			/* don't allocate array unless required */
-	relid = 0;
-	foreach(lc, parse->rtable)
-	{
-		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
-		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
-
-		relid++;
-
-		/* Only plain relations could have primary-key constraints */
-		if (rte->rtekind != RTE_RELATION)
-			continue;
-
-		/*
-		 * We must skip inheritance parent tables as some of the child rels
-		 * may cause duplicate rows.  This cannot happen with partitioned
-		 * tables, however.
-		 */
-		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
-			continue;
-
-		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
-		relattnos = groupbyattnos[relid];
-		if (bms_membership(relattnos) != BMS_MULTIPLE)
-			continue;
-
-		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
-		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
-
-		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
-		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
-		{
-			/*
-			 * To easily remember whether we've found anything to do, we don't
-			 * allocate the surplusvars[] array until we find something.
-			 */
-			if (surplusvars == NULL)
-				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-													 (list_length(parse->rtable) + 1));
-
-			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
-		}
-	}
-
-	/*
-	 * If we found any surplus Vars, build a new GROUP BY clause without them.
-	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
-	 * markings, but that's harmless.)
-	 */
-	if (surplusvars != NULL)
-	{
-		List	   *new_groupby = NIL;
-
-		foreach(lc, root->processed_groupClause)
-		{
-			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-			Var		   *var = (Var *) tle->expr;
-
-			/*
-			 * New list must include non-Vars, outer Vars, and anything not
-			 * marked as surplus.
-			 */
-			if (!IsA(var, Var) ||
-				var->varlevelsup > 0 ||
-				!bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
-							   surplusvars[var->varno]))
-				new_groupby = lappend(new_groupby, sgc);
-		}
-
-		root->processed_groupClause = new_groupby;
-	}
-}
-
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
  *
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e43..153390f2dc 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -457,6 +457,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->indrestrictinfo = NIL;	/* set later, in indxpath.c */
 			info->predOK = false;	/* set later, in indxpath.c */
 			info->unique = index->indisunique;
+			info->nullsnotdistinct = index->indnullsnotdistinct;
 			info->immediate = index->indimmediate;
 			info->hypothetical = false;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..0759e00e96 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1187,6 +1187,8 @@ struct IndexOptInfo
 	bool		predOK;
 	/* true if a unique index */
 	bool		unique;
+	/* true if the index was defined with NULLS NOT DISTINCT */
+	bool		nullsnotdistinct;
 	/* is uniqueness enforced immediately? */
 	bool		immediate;
 	/* true if index doesn't really exist */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 93137261e4..0b6f0f7969 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -74,6 +74,7 @@ extern void add_vars_to_targetlist(PlannerInfo *root, List *vars,
 								   Relids where_needed);
 extern void add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 									Relids where_needed);
+extern void remove_useless_groupby_columns(PlannerInfo *root);
 extern void find_lateral_references(PlannerInfo *root);
 extern void rebuild_lateral_attr_needed(PlannerInfo *root);
 extern void create_lateral_join_info(PlannerInfo *root);
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..27ea09732f 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1448,6 +1448,62 @@ explain (costs off) select * from p_t1 group by a,b,c,d;
          ->  Seq Scan on p_t1_2
 (5 rows)
 
+create unique index t3_c_uidx on t3(c);
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- When there are multiple supporting unique indexes and the GROUP BY contains
+-- columns to cover all of those, ensure we pick the index with the least
+-- number of columns so that we can remove more columns from the GROUP BY.
+explain (costs off) select a,b,c from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- As above but try ordering the columns differently to ensure we get the
+-- same result.
+explain (costs off) select a,b,c from t3 group by c,a,b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
 drop table t1 cascade;
 NOTICE:  drop cascades to table t1c
 drop table t2;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..4e23fb4f38 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -507,6 +507,33 @@ create temp table p_t1_2 partition of p_t1 for values in(2);
 -- Ensure we can remove non-PK columns for partitioned tables.
 explain (costs off) select * from p_t1 group by a,b,c,d;
 
+create unique index t3_c_uidx on t3(c);
+
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+
+-- When there are multiple supporting unique indexes and the GROUP BY contains
+-- columns to cover all of those, ensure we pick the index with the least
+-- number of columns so that we can remove more columns from the GROUP BY.
+explain (costs off) select a,b,c from t3 group by a,b,c;
+
+-- As above but try ordering the columns differently to ensure we get the
+-- same result.
+explain (costs off) select a,b,c from t3 group by c,a,b;
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+
 drop table t1 cascade;
 drop table t2;
 drop table t3;
-- 
2.34.1

#16Andrei Lepikhov
lepihov@gmail.com
In reply to: David Rowley (#15)
Re: Remove useless GROUP BY columns considering unique index

On 11/29/24 09:39, David Rowley wrote:

On Fri, 29 Nov 2024 at 15:02, David Rowley <dgrowleyml@gmail.com> wrote:

I've attached an updated patch that gets rid of the
get_unique_not_null_attnos() function completely and uses the
RelOptInfo.indexlist and RelOptInfo.notnullattnums.

I forgot to do a local commit before sending v8. Fixed in the attached v9.

1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the
second commit? It would be easier to review the differences.

--
regards, Andrei Lepikhov

#17jian he
jian.universality@gmail.com
In reply to: Andrei Lepikhov (#16)
1 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Fri, Nov 29, 2024 at 2:36 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

I forgot to do a local commit before sending v8. Fixed in the attached v9.

1. Thread reference in the patch comment doesn't work.

fixed.

I found that unique expression index can also be used for groupby
column removal.
so I implemented it, aslo added two tests on it.

now remove_useless_groupby_columns like:
if (index->indexprs == NIL)
{
--handle unique index
}
else
{
--handle unique expression index
}

what do you think?

Attachments:

v10-0001-remove-useless-group-by-columns-via-unique-not-n.patchtext/x-patch; charset=US-ASCII; name=v10-0001-remove-useless-group-by-columns-via-unique-not-n.patchDownload
From ca9302a8c98f2ad4a8f9837ec42ec2da6588e25d Mon Sep 17 00:00:00 2001
From: jian he <jian.universality@gmail.com>
Date: Fri, 29 Nov 2024 14:54:23 +0800
Subject: [PATCH v10 1/1] remove useless group by columns via unique-not-null

In the `remove_useless_groupby_columns` function,
use a unique index and a not-null constraint to eliminate unnecessary columns.
Here, the primary key is treated as a "unique index and not-null".
If there are multiple candidates,
use the unique index with the fewest columns to remove the other columns
in groupby expression.

discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0@Spark
---
 src/backend/optimizer/plan/initsplan.c   | 258 ++++++++++++++++++++++-
 src/backend/optimizer/plan/planmain.c    |   3 +
 src/backend/optimizer/plan/planner.c     | 163 --------------
 src/backend/optimizer/util/plancat.c     |   1 +
 src/include/nodes/pathnodes.h            |   2 +
 src/include/optimizer/planmain.h         |   1 +
 src/test/regress/expected/aggregates.out |  79 +++++++
 src/test/regress/sql/aggregates.sql      |  39 ++++
 8 files changed, 382 insertions(+), 164 deletions(-)

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 903c397d40..bd82745f1e 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * initsplan.c
- *	  Target list, qualification, joininfo initialization routines
+ *	  Target list, group by, qualification, joininfo initialization routines
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -387,6 +387,262 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 }
 
 
+/*****************************************************************************
+ *
+ *	  GROUP BY
+ *
+ *****************************************************************************/
+
+/*
+ * remove_useless_groupby_columns
+ *		Remove any columns in the GROUP BY clause that are redundant due to
+ *		being functionally dependent on other GROUP BY columns.
+ *
+ * Since some other DBMSes do not allow references to ungrouped columns, it's
+ * not unusual to find all columns listed in GROUP BY even though listing the
+ * primary-key columns, or columns of a unique constraint would be sufficient.
+ * Deleting such excess columns avoids redundant sorting or hashing work, so
+ * it's worth doing.
+ *
+ * Relcache invalidations will ensure that cached plans become invalidated
+ * when the underlying supporting indexes are dropped or if a column's NOT
+ * NULL attribute is removed.
+ */
+void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	Query	   *parse = root->parse;
+	Bitmapset **groupbyattnos;
+	Bitmapset **surplusvars;
+	ListCell   *lc;
+	int			relid;
+
+	/* No chance to do anything if there are less than two GROUP BY items */
+	if (list_length(root->processed_groupClause) < 2)
+		return;
+
+	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
+	if (parse->groupingSets)
+		return;
+
+	/*
+	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
+	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
+	 * that are GROUP BY items.
+	 */
+	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) + 1));
+	foreach(lc, root->processed_groupClause)
+	{
+		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		Var		   *var = (Var *) tle->expr;
+
+		/*
+		 * Ignore non-Vars and Vars from other query levels.
+		 *
+		 * XXX in principle, stable expressions containing Vars could also be
+		 * removed, if all the Vars are functionally dependent on other GROUP
+		 * BY items.  But it's not clear that such cases occur often enough to
+		 * be worth troubling over.
+		 */
+		if (!IsA(var, Var) ||
+			var->varlevelsup > 0)
+			continue;
+
+		/* OK, remember we have this Var */
+		relid = var->varno;
+		Assert(relid <= list_length(parse->rtable));
+		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
+											  var->varattno - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/*
+	 * Consider each relation and see if it is possible to remove some of its
+	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
+	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
+	 * of the column attnos of RTE k that are removable GROUP BY items.
+	 */
+	surplusvars = NULL;			/* don't allocate array unless required */
+	relid = 0;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
+		RelOptInfo *rel;
+		Bitmapset  *relattnos;
+		Bitmapset  *best_keycolumns = NULL;
+		int32		best_nkeycolumns = PG_INT32_MAX;
+
+		relid++;
+
+		/* Only plain relations could have primary-key constraints */
+		if (rte->rtekind != RTE_RELATION)
+			continue;
+
+		/*
+		 * We must skip inheritance parent tables as some of the child rels
+		 * may cause duplicate rows.  This cannot happen with partitioned
+		 * tables, however.
+		 */
+		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
+		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
+		relattnos = groupbyattnos[relid];
+		if (bms_membership(relattnos) != BMS_MULTIPLE)
+			continue;
+
+		rel = root->simple_rel_array[relid];
+
+		/*
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
+		 */
+		foreach_node(IndexOptInfo, index, rel->indexlist)
+		{
+			Bitmapset  *ind_attnos;
+			bool		nulls_check_ok;
+
+			/*
+			 * Skip any non-unique and deferrable indexes.  Predicate indexes
+			 * have not been checked yet, so we must skip those too as the
+			 * predOK check might fail.
+			 */
+			if (!index->unique || !index->immediate || index->indpred != NIL)
+				continue;
+
+			ind_attnos = NULL;
+			nulls_check_ok = true;
+			if (index->indexprs == NIL)
+			{
+				for (int i = 0; i < index->nkeycolumns; i++)
+				{
+					/*
+					* We must insist that the index columns are all defined NOT
+					* NULL otherwise duplicate NULLs could exists.  However, we
+					* can relax this check when the index is defined with NULLS
+					* NOT DISTINCT as there can only be 1 NULL row, therefore
+					* functional dependency on the unique columns is maintained,
+					* despite the NULL.
+					*/
+					if (!index->nullsnotdistinct &&
+						!bms_is_member(index->indexkeys[i],
+									rel->notnullattnums))
+					{
+						nulls_check_ok = false;
+						break;
+					}
+
+					ind_attnos =
+						bms_add_member(ind_attnos,
+									index->indexkeys[i] -
+									FirstLowInvalidHeapAttributeNumber);
+				}
+			}
+			else
+			{
+				int			attrnos = -1;
+
+				Bitmapset  *expr_attrs = NULL;
+				ListCell   *indexpr_item = NULL;
+
+				indexpr_item = list_head(index->indexprs);
+				pull_varattnos((Node *) lfirst(indexpr_item), 1, &expr_attrs);
+
+				/* 
+				 * pull_varattnos returned varattnos is offset by
+				 * FirstLowInvalidHeapAttributeNumber. we need add
+				 * FirstLowInvalidHeapAttributeNumber for comparison with
+				 * rel->notnullattnums
+				*/
+				while ((attrnos = bms_next_member(expr_attrs, attrnos)) >= 0)
+				{
+					if (!index->nullsnotdistinct &&
+						!bms_is_member(attrnos + FirstLowInvalidHeapAttributeNumber, rel->notnullattnums))
+					{
+						nulls_check_ok = false;
+						break;
+					}
+				}
+
+				ind_attnos = expr_attrs;
+			}
+
+			if (!nulls_check_ok)
+				continue;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			if (index->nkeycolumns < best_nkeycolumns)
+			{
+				best_keycolumns = ind_attnos;
+				best_nkeycolumns = index->nkeycolumns;
+			}
+		}
+
+		/* Did we find a suitable index? */
+		if (best_keycolumns != NULL)
+		{
+			/*
+			 * To easily remember whether we've found anything to do, we don't
+			 * allocate the surplusvars[] array until we find something.
+			 */
+			if (surplusvars == NULL)
+				surplusvars =
+					(Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) +
+											1));
+
+			/* Remember the attnos of the removable columns */
+			surplusvars[relid] = bms_difference(relattnos, best_keycolumns);
+		}
+	}
+
+	/*
+	 * If we found any surplus Vars, build a new GROUP BY clause without them.
+	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
+	 * markings, but that's harmless.)
+	 */
+	if (surplusvars != NULL)
+	{
+		List	   *new_groupby = NIL;
+
+		foreach(lc, root->processed_groupClause)
+		{
+			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+			Var		   *var = (Var *) tle->expr;
+
+			/*
+			 * New list must include non-Vars, outer Vars, and anything not
+			 * marked as surplus.
+			 */
+			if (!IsA(var, Var) || var->varlevelsup > 0 ||
+				!bms_is_member(var->varattno -
+							   FirstLowInvalidHeapAttributeNumber,
+							   surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+		}
+
+		root->processed_groupClause = new_groupby;
+	}
+}
+
+
 /*****************************************************************************
  *
  *	  LATERAL REFERENCES
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index e17d31a5c3..735560e8ca 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -169,6 +169,9 @@ query_planner(PlannerInfo *root,
 	 */
 	add_base_rels_to_query(root, (Node *) parse->jointree);
 
+	/* Remove any redundant GROUP BY columns */
+	remove_useless_groupby_columns(root);
+
 	/*
 	 * Examine the targetlist and join tree, adding entries to baserel
 	 * targetlists for all referenced Vars, and generating PlaceHolderInfo
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..352c4edf3a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -139,7 +139,6 @@ static void preprocess_rowmarks(PlannerInfo *root);
 static double preprocess_limit(PlannerInfo *root,
 							   double tuple_fraction,
 							   int64 *offset_est, int64 *count_est);
-static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
 static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
@@ -1487,8 +1486,6 @@ grouping_planner(PlannerInfo *root, double tuple_fraction,
 		{
 			/* Preprocess regular GROUP BY clause, if any */
 			root->processed_groupClause = preprocess_groupclause(root, NIL);
-			/* Remove any redundant GROUP BY columns */
-			remove_useless_groupby_columns(root);
 		}
 
 		/*
@@ -2724,166 +2721,6 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
-
-/*
- * remove_useless_groupby_columns
- *		Remove any columns in the GROUP BY clause that are redundant due to
- *		being functionally dependent on other GROUP BY columns.
- *
- * Since some other DBMSes do not allow references to ungrouped columns, it's
- * not unusual to find all columns listed in GROUP BY even though listing the
- * primary-key columns would be sufficient.  Deleting such excess columns
- * avoids redundant sorting work, so it's worth doing.
- *
- * Relcache invalidations will ensure that cached plans become invalidated
- * when the underlying index of the pkey constraint is dropped.
- *
- * Currently, we only make use of pkey constraints for this, however, we may
- * wish to take this further in the future and also use unique constraints
- * which have NOT NULL columns.  In that case, plan invalidation will still
- * work since relations will receive a relcache invalidation when a NOT NULL
- * constraint is dropped.
- */
-static void
-remove_useless_groupby_columns(PlannerInfo *root)
-{
-	Query	   *parse = root->parse;
-	Bitmapset **groupbyattnos;
-	Bitmapset **surplusvars;
-	ListCell   *lc;
-	int			relid;
-
-	/* No chance to do anything if there are less than two GROUP BY items */
-	if (list_length(root->processed_groupClause) < 2)
-		return;
-
-	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
-	if (parse->groupingSets)
-		return;
-
-	/*
-	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
-	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
-	 * that are GROUP BY items.
-	 */
-	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-										   (list_length(parse->rtable) + 1));
-	foreach(lc, root->processed_groupClause)
-	{
-		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-		Var		   *var = (Var *) tle->expr;
-
-		/*
-		 * Ignore non-Vars and Vars from other query levels.
-		 *
-		 * XXX in principle, stable expressions containing Vars could also be
-		 * removed, if all the Vars are functionally dependent on other GROUP
-		 * BY items.  But it's not clear that such cases occur often enough to
-		 * be worth troubling over.
-		 */
-		if (!IsA(var, Var) ||
-			var->varlevelsup > 0)
-			continue;
-
-		/* OK, remember we have this Var */
-		relid = var->varno;
-		Assert(relid <= list_length(parse->rtable));
-		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
-											  var->varattno - FirstLowInvalidHeapAttributeNumber);
-	}
-
-	/*
-	 * Consider each relation and see if it is possible to remove some of its
-	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
-	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
-	 * of the column attnos of RTE k that are removable GROUP BY items.
-	 */
-	surplusvars = NULL;			/* don't allocate array unless required */
-	relid = 0;
-	foreach(lc, parse->rtable)
-	{
-		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
-		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
-
-		relid++;
-
-		/* Only plain relations could have primary-key constraints */
-		if (rte->rtekind != RTE_RELATION)
-			continue;
-
-		/*
-		 * We must skip inheritance parent tables as some of the child rels
-		 * may cause duplicate rows.  This cannot happen with partitioned
-		 * tables, however.
-		 */
-		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
-			continue;
-
-		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
-		relattnos = groupbyattnos[relid];
-		if (bms_membership(relattnos) != BMS_MULTIPLE)
-			continue;
-
-		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
-		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
-
-		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
-		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
-		{
-			/*
-			 * To easily remember whether we've found anything to do, we don't
-			 * allocate the surplusvars[] array until we find something.
-			 */
-			if (surplusvars == NULL)
-				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-													 (list_length(parse->rtable) + 1));
-
-			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
-		}
-	}
-
-	/*
-	 * If we found any surplus Vars, build a new GROUP BY clause without them.
-	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
-	 * markings, but that's harmless.)
-	 */
-	if (surplusvars != NULL)
-	{
-		List	   *new_groupby = NIL;
-
-		foreach(lc, root->processed_groupClause)
-		{
-			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-			Var		   *var = (Var *) tle->expr;
-
-			/*
-			 * New list must include non-Vars, outer Vars, and anything not
-			 * marked as surplus.
-			 */
-			if (!IsA(var, Var) ||
-				var->varlevelsup > 0 ||
-				!bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
-							   surplusvars[var->varno]))
-				new_groupby = lappend(new_groupby, sgc);
-		}
-
-		root->processed_groupClause = new_groupby;
-	}
-}
-
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
  *
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e43..153390f2dc 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -457,6 +457,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->indrestrictinfo = NIL;	/* set later, in indxpath.c */
 			info->predOK = false;	/* set later, in indxpath.c */
 			info->unique = index->indisunique;
+			info->nullsnotdistinct = index->indnullsnotdistinct;
 			info->immediate = index->indimmediate;
 			info->hypothetical = false;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..0759e00e96 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1187,6 +1187,8 @@ struct IndexOptInfo
 	bool		predOK;
 	/* true if a unique index */
 	bool		unique;
+	/* true if the index was defined with NULLS NOT DISTINCT */
+	bool		nullsnotdistinct;
 	/* is uniqueness enforced immediately? */
 	bool		immediate;
 	/* true if index doesn't really exist */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 93137261e4..0b6f0f7969 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -74,6 +74,7 @@ extern void add_vars_to_targetlist(PlannerInfo *root, List *vars,
 								   Relids where_needed);
 extern void add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 									Relids where_needed);
+extern void remove_useless_groupby_columns(PlannerInfo *root);
 extern void find_lateral_references(PlannerInfo *root);
 extern void rebuild_lateral_attr_needed(PlannerInfo *root);
 extern void create_lateral_join_info(PlannerInfo *root);
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..23890c6e1f 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1448,6 +1448,85 @@ explain (costs off) select * from p_t1 group by a,b,c,d;
          ->  Seq Scan on p_t1_2
 (5 rows)
 
+create unique index t3_c_uidx on t3(c);
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- When there are multiple supporting unique indexes and the GROUP BY contains
+-- columns to cover all of those, ensure we pick the index with the least
+-- number of columns so that we can remove more columns from the GROUP BY.
+explain (costs off) select a,b,c from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- As above but try ordering the columns differently to ensure we get the
+-- same result.
+explain (costs off) select a,b,c from t3 group by c,a,b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+--test unique expression index can be used to remove redundant columns.
+drop index t3_c_uidx;
+create unique index t3_bc on t3((c+b));
+alter table t3 alter column c set not null;
+explain(costs off) select a,b,c from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+--test NULLS NOT DISTINCT with unique expression index
+create unique index t3_bc_uidx on t3((c+b)) nulls not distinct;
+alter table t3 alter column c drop not null;
+explain(costs off) select a,b,c from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
 drop table t1 cascade;
 NOTICE:  drop cascades to table t1c
 drop table t2;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..b4d1b9b97a 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -507,6 +507,45 @@ create temp table p_t1_2 partition of p_t1 for values in(2);
 -- Ensure we can remove non-PK columns for partitioned tables.
 explain (costs off) select * from p_t1 group by a,b,c,d;
 
+create unique index t3_c_uidx on t3(c);
+
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+
+-- When there are multiple supporting unique indexes and the GROUP BY contains
+-- columns to cover all of those, ensure we pick the index with the least
+-- number of columns so that we can remove more columns from the GROUP BY.
+explain (costs off) select a,b,c from t3 group by a,b,c;
+
+-- As above but try ordering the columns differently to ensure we get the
+-- same result.
+explain (costs off) select a,b,c from t3 group by c,a,b;
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+
+
+--test unique expression index can be used to remove redundant columns.
+drop index t3_c_uidx;
+create unique index t3_bc on t3((c+b));
+alter table t3 alter column c set not null;
+explain(costs off) select a,b,c from t3 group by a,b,c;
+
+--test NULLS NOT DISTINCT with unique expression index
+create unique index t3_bc_uidx on t3((c+b)) nulls not distinct;
+alter table t3 alter column c drop not null;
+explain(costs off) select a,b,c from t3 group by a,b,c;
+
 drop table t1 cascade;
 drop table t2;
 drop table t3;
-- 
2.34.1

#18David Rowley
dgrowleyml@gmail.com
In reply to: Andrei Lepikhov (#16)
2 attachment(s)
Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 19:36, Andrei Lepikhov <lepihov@gmail.com> wrote:

1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the
second commit? It would be easier to review the differences.

For #1, I rewrote the commit message.

I've split into two patches for ease of review. See attached.

David

Attachments:

v10-0001-Move-remove_useless_groupby_columns-to-initsplan.patchapplication/octet-stream; name=v10-0001-Move-remove_useless_groupby_columns-to-initsplan.patchDownload
From c76118af5911058b2ba35ff507709a12cef9f15e Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 29 Nov 2024 19:56:06 +1300
Subject: [PATCH v10 1/2] Move remove_useless_groupby_columns to initsplan.c

---
 src/backend/optimizer/plan/initsplan.c | 167 ++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c   | 162 ------------------------
 src/include/optimizer/planmain.h       |   1 +
 3 files changed, 167 insertions(+), 163 deletions(-)

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 903c397d40..c1c4f9f8b9 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * initsplan.c
- *	  Target list, qualification, joininfo initialization routines
+ *	  Target list, group by, qualification, joininfo initialization routines
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_constraint.h"
 #include "catalog/pg_type.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -386,6 +387,170 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 	}
 }
 
+/*****************************************************************************
+ *
+ *	  GROUP BY
+ *
+ *****************************************************************************/
+
+/*
+ * remove_useless_groupby_columns
+ *		Remove any columns in the GROUP BY clause that are redundant due to
+ *		being functionally dependent on other GROUP BY columns.
+ *
+ * Since some other DBMSes do not allow references to ungrouped columns, it's
+ * not unusual to find all columns listed in GROUP BY even though listing the
+ * primary-key columns would be sufficient.  Deleting such excess columns
+ * avoids redundant sorting work, so it's worth doing.
+ *
+ * Relcache invalidations will ensure that cached plans become invalidated
+ * when the underlying index of the pkey constraint is dropped.
+ *
+ * Currently, we only make use of pkey constraints for this, however, we may
+ * wish to take this further in the future and also use unique constraints
+ * which have NOT NULL columns.  In that case, plan invalidation will still
+ * work since relations will receive a relcache invalidation when a NOT NULL
+ * constraint is dropped.
+ */
+void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	Query	   *parse = root->parse;
+	Bitmapset **groupbyattnos;
+	Bitmapset **surplusvars;
+	ListCell   *lc;
+	int			relid;
+
+	/* No chance to do anything if there are less than two GROUP BY items */
+	if (list_length(root->processed_groupClause) < 2)
+		return;
+
+	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
+	if (parse->groupingSets)
+		return;
+
+	/*
+	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
+	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
+	 * that are GROUP BY items.
+	 */
+	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) + 1));
+	foreach(lc, root->processed_groupClause)
+	{
+		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		Var		   *var = (Var *) tle->expr;
+
+		/*
+		 * Ignore non-Vars and Vars from other query levels.
+		 *
+		 * XXX in principle, stable expressions containing Vars could also be
+		 * removed, if all the Vars are functionally dependent on other GROUP
+		 * BY items.  But it's not clear that such cases occur often enough to
+		 * be worth troubling over.
+		 */
+		if (!IsA(var, Var) ||
+			var->varlevelsup > 0)
+			continue;
+
+		/* OK, remember we have this Var */
+		relid = var->varno;
+		Assert(relid <= list_length(parse->rtable));
+		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
+											  var->varattno - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/*
+	 * Consider each relation and see if it is possible to remove some of its
+	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
+	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
+	 * of the column attnos of RTE k that are removable GROUP BY items.
+	 */
+	surplusvars = NULL;			/* don't allocate array unless required */
+	relid = 0;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
+		Bitmapset  *relattnos;
+		Bitmapset  *pkattnos;
+		Oid			constraintOid;
+
+		relid++;
+
+		/* Only plain relations could have primary-key constraints */
+		if (rte->rtekind != RTE_RELATION)
+			continue;
+
+		/*
+		 * We must skip inheritance parent tables as some of the child rels
+		 * may cause duplicate rows.  This cannot happen with partitioned
+		 * tables, however.
+		 */
+		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
+			continue;
+
+		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
+		relattnos = groupbyattnos[relid];
+		if (bms_membership(relattnos) != BMS_MULTIPLE)
+			continue;
+
+		/*
+		 * Can't remove any columns for this rel if there is no suitable
+		 * (i.e., nondeferrable) primary key constraint.
+		 */
+		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
+		if (pkattnos == NULL)
+			continue;
+
+		/*
+		 * If the primary key is a proper subset of relattnos then we have
+		 * some items in the GROUP BY that can be removed.
+		 */
+		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		{
+			/*
+			 * To easily remember whether we've found anything to do, we don't
+			 * allocate the surplusvars[] array until we find something.
+			 */
+			if (surplusvars == NULL)
+				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+													 (list_length(parse->rtable) + 1));
+
+			/* Remember the attnos of the removable columns */
+			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+		}
+	}
+
+	/*
+	 * If we found any surplus Vars, build a new GROUP BY clause without them.
+	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
+	 * markings, but that's harmless.)
+	 */
+	if (surplusvars != NULL)
+	{
+		List	   *new_groupby = NIL;
+
+		foreach(lc, root->processed_groupClause)
+		{
+			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
+			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
+			Var		   *var = (Var *) tle->expr;
+
+			/*
+			 * New list must include non-Vars, outer Vars, and anything not
+			 * marked as surplus.
+			 */
+			if (!IsA(var, Var) ||
+				var->varlevelsup > 0 ||
+				!bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
+							   surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+		}
+
+		root->processed_groupClause = new_groupby;
+	}
+}
 
 /*****************************************************************************
  *
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762e..84d9e99b0d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -23,7 +23,6 @@
 #include "access/sysattr.h"
 #include "access/table.h"
 #include "catalog/pg_aggregate.h"
-#include "catalog/pg_constraint.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
@@ -139,7 +138,6 @@ static void preprocess_rowmarks(PlannerInfo *root);
 static double preprocess_limit(PlannerInfo *root,
 							   double tuple_fraction,
 							   int64 *offset_est, int64 *count_est);
-static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
 static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
@@ -2724,166 +2722,6 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
-
-/*
- * remove_useless_groupby_columns
- *		Remove any columns in the GROUP BY clause that are redundant due to
- *		being functionally dependent on other GROUP BY columns.
- *
- * Since some other DBMSes do not allow references to ungrouped columns, it's
- * not unusual to find all columns listed in GROUP BY even though listing the
- * primary-key columns would be sufficient.  Deleting such excess columns
- * avoids redundant sorting work, so it's worth doing.
- *
- * Relcache invalidations will ensure that cached plans become invalidated
- * when the underlying index of the pkey constraint is dropped.
- *
- * Currently, we only make use of pkey constraints for this, however, we may
- * wish to take this further in the future and also use unique constraints
- * which have NOT NULL columns.  In that case, plan invalidation will still
- * work since relations will receive a relcache invalidation when a NOT NULL
- * constraint is dropped.
- */
-static void
-remove_useless_groupby_columns(PlannerInfo *root)
-{
-	Query	   *parse = root->parse;
-	Bitmapset **groupbyattnos;
-	Bitmapset **surplusvars;
-	ListCell   *lc;
-	int			relid;
-
-	/* No chance to do anything if there are less than two GROUP BY items */
-	if (list_length(root->processed_groupClause) < 2)
-		return;
-
-	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
-	if (parse->groupingSets)
-		return;
-
-	/*
-	 * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
-	 * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
-	 * that are GROUP BY items.
-	 */
-	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-										   (list_length(parse->rtable) + 1));
-	foreach(lc, root->processed_groupClause)
-	{
-		SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-		TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-		Var		   *var = (Var *) tle->expr;
-
-		/*
-		 * Ignore non-Vars and Vars from other query levels.
-		 *
-		 * XXX in principle, stable expressions containing Vars could also be
-		 * removed, if all the Vars are functionally dependent on other GROUP
-		 * BY items.  But it's not clear that such cases occur often enough to
-		 * be worth troubling over.
-		 */
-		if (!IsA(var, Var) ||
-			var->varlevelsup > 0)
-			continue;
-
-		/* OK, remember we have this Var */
-		relid = var->varno;
-		Assert(relid <= list_length(parse->rtable));
-		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
-											  var->varattno - FirstLowInvalidHeapAttributeNumber);
-	}
-
-	/*
-	 * Consider each relation and see if it is possible to remove some of its
-	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
-	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
-	 * of the column attnos of RTE k that are removable GROUP BY items.
-	 */
-	surplusvars = NULL;			/* don't allocate array unless required */
-	relid = 0;
-	foreach(lc, parse->rtable)
-	{
-		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
-		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
-
-		relid++;
-
-		/* Only plain relations could have primary-key constraints */
-		if (rte->rtekind != RTE_RELATION)
-			continue;
-
-		/*
-		 * We must skip inheritance parent tables as some of the child rels
-		 * may cause duplicate rows.  This cannot happen with partitioned
-		 * tables, however.
-		 */
-		if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)
-			continue;
-
-		/* Nothing to do unless this rel has multiple Vars in GROUP BY */
-		relattnos = groupbyattnos[relid];
-		if (bms_membership(relattnos) != BMS_MULTIPLE)
-			continue;
-
-		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
-		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
-
-		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
-		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
-		{
-			/*
-			 * To easily remember whether we've found anything to do, we don't
-			 * allocate the surplusvars[] array until we find something.
-			 */
-			if (surplusvars == NULL)
-				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-													 (list_length(parse->rtable) + 1));
-
-			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
-		}
-	}
-
-	/*
-	 * If we found any surplus Vars, build a new GROUP BY clause without them.
-	 * (Note: this may leave some TLEs with unreferenced ressortgroupref
-	 * markings, but that's harmless.)
-	 */
-	if (surplusvars != NULL)
-	{
-		List	   *new_groupby = NIL;
-
-		foreach(lc, root->processed_groupClause)
-		{
-			SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
-			TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
-			Var		   *var = (Var *) tle->expr;
-
-			/*
-			 * New list must include non-Vars, outer Vars, and anything not
-			 * marked as surplus.
-			 */
-			if (!IsA(var, Var) ||
-				var->varlevelsup > 0 ||
-				!bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
-							   surplusvars[var->varno]))
-				new_groupby = lappend(new_groupby, sgc);
-		}
-
-		root->processed_groupClause = new_groupby;
-	}
-}
-
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
  *
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 93137261e4..0b6f0f7969 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -74,6 +74,7 @@ extern void add_vars_to_targetlist(PlannerInfo *root, List *vars,
 								   Relids where_needed);
 extern void add_vars_to_attr_needed(PlannerInfo *root, List *vars,
 									Relids where_needed);
+extern void remove_useless_groupby_columns(PlannerInfo *root);
 extern void find_lateral_references(PlannerInfo *root);
 extern void rebuild_lateral_attr_needed(PlannerInfo *root);
 extern void create_lateral_join_info(PlannerInfo *root);
-- 
2.34.1

v10-0002-Remove-redundant-GROUP-BY-columns-using-UNIQUE-i.patchapplication/octet-stream; name=v10-0002-Remove-redundant-GROUP-BY-columns-using-UNIQUE-i.patchDownload
From f72c0846a1be1d544f904bde7bae1863436f1556 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 29 Nov 2024 20:05:07 +1300
Subject: [PATCH v10 2/2] Remove redundant GROUP BY columns using UNIQUE
 indexes

d4c3a156c added support that when the GROUP BY contained all of the
columns belonging to a relations PRIMARY KEY, that all other columns
belonging to that relation could be removed from the GROUP BY clause.
That's possible because all other columns are functionally dependent on
the PRIMARY KEY.

Here we expand on that optimization and allow it to work for any UNIQUE
indexes on the table rather than just the PRIMARY KEY.  This normally
requires that all columns in the index are defined with NOT NULL,
however, we can relax that requirement when the index is defined with
NULLS NOT DISTINCT.

Previously remove_useless_groupby_columns() was being called from
grouping_planner() long before RelOptInfos were built for each
RangeTblEntry.  Here we delay calling remove_useless_groupby_columns()
until just after add_base_rels_to_query(), where RelOptInfos are created
in query_planner().  Doing this allows us to much more easily implement
this as all the relation metadata is then available in RelOptInfo form
and we no longer have to query the catalog tables.

Patch originally by Zhang Mingli and later worked on by jian he, but after
I (David) worked on it, there was very little of the original left.

Author: David Rowley, Zhang Mingli, jian he
Discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0%40Spark
---
 src/backend/optimizer/plan/initsplan.c   | 117 +++++++++++++++++------
 src/backend/optimizer/plan/planmain.c    |   3 +
 src/backend/optimizer/plan/planner.c     |   2 -
 src/backend/optimizer/util/plancat.c     |   1 +
 src/include/nodes/pathnodes.h            |   2 +
 src/test/regress/expected/aggregates.out |  56 +++++++++++
 src/test/regress/sql/aggregates.sql      |  27 ++++++
 7 files changed, 179 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index c1c4f9f8b9..ac36b70746 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -400,17 +400,13 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
  *
  * Since some other DBMSes do not allow references to ungrouped columns, it's
  * not unusual to find all columns listed in GROUP BY even though listing the
- * primary-key columns would be sufficient.  Deleting such excess columns
- * avoids redundant sorting work, so it's worth doing.
+ * primary-key columns, or columns of a unique constraint would be sufficient.
+ * Deleting such excess columns avoids redundant sorting or hashing work, so
+ * it's worth doing.
  *
  * Relcache invalidations will ensure that cached plans become invalidated
- * when the underlying index of the pkey constraint is dropped.
- *
- * Currently, we only make use of pkey constraints for this, however, we may
- * wish to take this further in the future and also use unique constraints
- * which have NOT NULL columns.  In that case, plan invalidation will still
- * work since relations will receive a relcache invalidation when a NOT NULL
- * constraint is dropped.
+ * when the underlying supporting indexes are dropped or if a column's NOT
+ * NULL attribute is removed.
  */
 void
 remove_useless_groupby_columns(PlannerInfo *root)
@@ -472,9 +468,10 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	foreach(lc, parse->rtable)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
+		RelOptInfo *rel;
 		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
+		Bitmapset  *best_keycolumns = NULL;
+		int32		best_nkeycolumns = PG_INT32_MAX;
 
 		relid++;
 
@@ -495,30 +492,96 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		if (bms_membership(relattnos) != BMS_MULTIPLE)
 			continue;
 
-		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
-		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
+		rel = root->simple_rel_array[relid];
 
 		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
+		 * Now search each index to check if there are any indexes where the
+		 * indexed columns are a subset of the GROUP BY columns for this
+		 * relation.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		foreach_node(IndexOptInfo, index, rel->indexlist)
+		{
+			Bitmapset  *ind_attnos;
+			bool		nulls_check_ok;
+
+			/*
+			 * Skip any non-unique and deferrable indexes.  Predicate indexes
+			 * have not been checked yet, so we must skip those too as the
+			 * predOK check might fail.
+			 */
+			if (!index->unique || !index->immediate || index->indpred != NIL)
+				continue;
+
+			/* For simplicity we currently don't support expression indexes */
+			if (index->indexprs != NIL)
+				continue;
+
+			ind_attnos = NULL;
+			nulls_check_ok = true;
+			for (int i = 0; i < index->nkeycolumns; i++)
+			{
+				/*
+				 * We must insist that the index columns are all defined NOT
+				 * NULL otherwise duplicate NULLs could exists.  However, we
+				 * can relax this check when the index is defined with NULLS
+				 * NOT DISTINCT as there can only be 1 NULL row, therefore
+				 * functional dependency on the unique columns is maintained,
+				 * despite the NULL.
+				 */
+				if (!index->nullsnotdistinct &&
+					!bms_is_member(index->indexkeys[i],
+								   rel->notnullattnums))
+				{
+					nulls_check_ok = false;
+					break;
+				}
+
+				ind_attnos =
+					bms_add_member(ind_attnos,
+								   index->indexkeys[i] -
+								   FirstLowInvalidHeapAttributeNumber);
+			}
+
+			if (!nulls_check_ok)
+				continue;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a subset of
+			 * the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			if (index->nkeycolumns < best_nkeycolumns)
+			{
+				best_keycolumns = ind_attnos;
+				best_nkeycolumns = index->nkeycolumns;
+			}
+		}
+
+		/* Did we find a suitable index? */
+		if (best_keycolumns != NULL)
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
 			 * allocate the surplusvars[] array until we find something.
 			 */
 			if (surplusvars == NULL)
-				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
-													 (list_length(parse->rtable) + 1));
+				surplusvars =
+					(Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) +
+											1));
 
 			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+			surplusvars[relid] = bms_difference(relattnos, best_keycolumns);
 		}
 	}
 
@@ -541,9 +604,9 @@ remove_useless_groupby_columns(PlannerInfo *root)
 			 * New list must include non-Vars, outer Vars, and anything not
 			 * marked as surplus.
 			 */
-			if (!IsA(var, Var) ||
-				var->varlevelsup > 0 ||
-				!bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
+			if (!IsA(var, Var) || var->varlevelsup > 0 ||
+				!bms_is_member(var->varattno -
+							   FirstLowInvalidHeapAttributeNumber,
 							   surplusvars[var->varno]))
 				new_groupby = lappend(new_groupby, sgc);
 		}
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index e17d31a5c3..735560e8ca 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -169,6 +169,9 @@ query_planner(PlannerInfo *root,
 	 */
 	add_base_rels_to_query(root, (Node *) parse->jointree);
 
+	/* Remove any redundant GROUP BY columns */
+	remove_useless_groupby_columns(root);
+
 	/*
 	 * Examine the targetlist and join tree, adding entries to baserel
 	 * targetlists for all referenced Vars, and generating PlaceHolderInfo
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 84d9e99b0d..a0a2de7ee4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1485,8 +1485,6 @@ grouping_planner(PlannerInfo *root, double tuple_fraction,
 		{
 			/* Preprocess regular GROUP BY clause, if any */
 			root->processed_groupClause = preprocess_groupclause(root, NIL);
-			/* Remove any redundant GROUP BY columns */
-			remove_useless_groupby_columns(root);
 		}
 
 		/*
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e43..153390f2dc 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -457,6 +457,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->indrestrictinfo = NIL;	/* set later, in indxpath.c */
 			info->predOK = false;	/* set later, in indxpath.c */
 			info->unique = index->indisunique;
+			info->nullsnotdistinct = index->indnullsnotdistinct;
 			info->immediate = index->indimmediate;
 			info->hypothetical = false;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index add0f9e45f..0759e00e96 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1187,6 +1187,8 @@ struct IndexOptInfo
 	bool		predOK;
 	/* true if a unique index */
 	bool		unique;
+	/* true if the index was defined with NULLS NOT DISTINCT */
+	bool		nullsnotdistinct;
 	/* is uniqueness enforced immediately? */
 	bool		immediate;
 	/* true if index doesn't really exist */
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1e682565d1..27ea09732f 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1448,6 +1448,62 @@ explain (costs off) select * from p_t1 group by a,b,c,d;
          ->  Seq Scan on p_t1_2
 (5 rows)
 
+create unique index t3_c_uidx on t3(c);
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- When there are multiple supporting unique indexes and the GROUP BY contains
+-- columns to cover all of those, ensure we pick the index with the least
+-- number of columns so that we can remove more columns from the GROUP BY.
+explain (costs off) select a,b,c from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- As above but try ordering the columns differently to ensure we get the
+-- same result.
+explain (costs off) select a,b,c from t3 group by c,a,b;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: c
+   ->  Seq Scan on t3
+(3 rows)
+
 drop table t1 cascade;
 NOTICE:  drop cascades to table t1c
 drop table t2;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 4885daffe6..4e23fb4f38 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -507,6 +507,33 @@ create temp table p_t1_2 partition of p_t1 for values in(2);
 -- Ensure we can remove non-PK columns for partitioned tables.
 explain (costs off) select * from p_t1 group by a,b,c,d;
 
+create unique index t3_c_uidx on t3(c);
+
+-- Ensure we don't remove any columns from the GROUP BY for a unique
+-- index on a NULLable column.
+explain (costs off) select b,c from t3 group by b,c;
+
+-- Make the column NOT NULL and ensure we remove the redundant column
+alter table t3 alter column c set not null;
+explain (costs off) select b,c from t3 group by b,c;
+
+-- When there are multiple supporting unique indexes and the GROUP BY contains
+-- columns to cover all of those, ensure we pick the index with the least
+-- number of columns so that we can remove more columns from the GROUP BY.
+explain (costs off) select a,b,c from t3 group by a,b,c;
+
+-- As above but try ordering the columns differently to ensure we get the
+-- same result.
+explain (costs off) select a,b,c from t3 group by c,a,b;
+
+-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
+-- NULL constraint on the indexed columns.  Ensure the redundant columns are
+-- removed from the GROUP BY for such a table.
+drop index t3_c_uidx;
+alter table t3 alter column c drop not null;
+create unique index t3_c_uidx on t3(c) nulls not distinct;
+explain (costs off) select b,c from t3 group by b,c;
+
 drop table t1 cascade;
 drop table t2;
 drop table t3;
-- 
2.34.1

#19David Rowley
dgrowleyml@gmail.com
In reply to: jian he (#17)
Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 20:04, jian he <jian.universality@gmail.com> wrote:

I found that unique expression index can also be used for groupby
column removal.
so I implemented it, aslo added two tests on it.

what do you think?

I think it's likely just not common enough to be worthwhile, plus,
unfortunately, I don't think this optimisation is possible with what
we have today. Also, if it were possible you'd need to check the GROUP
BY expressions match the indexed expressions. You've not done that.

The reason I don't think is possible is that we have no infrastructure
that allows us to tag functions or operators so that non-null input(s)
mean non-null outputs. We only have strict, which means null input
means null output. That's the opposite of what we'd need. It might
only be possible with a NULLS NOT DISTINCT index.

David

#20Andrei Lepikhov
lepihov@gmail.com
In reply to: David Rowley (#18)
Re: Remove useless GROUP BY columns considering unique index

On 29/11/2024 14:28, David Rowley wrote:

On Fri, 29 Nov 2024 at 19:36, Andrei Lepikhov <lepihov@gmail.com> wrote:

1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the
second commit? It would be easier to review the differences.

For #1, I rewrote the commit message.

I've split into two patches for ease of review. See attached.

Thanks!
Patch 0001 seems OK. The differentiation of 'planmain' and 'initsplan'
is not clear for me, but code movement seems correct. I should say that
comment in initsplan.c:
'... initialization routines'
looks strange. Is this module about initialisation now?

Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
been found.
It may restrict future optimisations like [1]/messages/by-id/50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com, where the upper query
block can employ the 'uniqueness' of grouped clauses in selectivity
estimations. But I think it is OK for now.

NOTES:
1. Uniqueness is proved by a UNIQUE index. It might be a bit more
sophisticated to probe not only grouping qual but also employ
equivalence classes. In that case, we could process something like that:

CREATE TABLE test (
x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
CREATE UNIQUE INDEX ON test(y,z);
SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
GROUP BY t1.y,t2.z,t2.w;

2. The same idea could be used with DISTINCT statement. Right now we have:
SELECT DISTINCT y,z,w FROM test;

HashAggregate
Output: y, z, w
Group Key: test.y, test.z, test.w
-> Seq Scan on public.test
Output: x, y, z, w

[1]: /messages/by-id/50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com
/messages/by-id/50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com

--
regards, Andrei Lepikhov

#21jian he
jian.universality@gmail.com
In reply to: David Rowley (#19)
Re: Remove useless GROUP BY columns considering unique index

On Fri, Nov 29, 2024 at 5:20 PM David Rowley <dgrowleyml@gmail.com> wrote:

The reason I don't think is possible is that we have no infrastructure
that allows us to tag functions or operators so that non-null input(s)
mean non-null outputs. We only have strict, which means null input
means null output. That's the opposite of what we'd need. It might
only be possible with a NULLS NOT DISTINCT index.

Thank you for pointing out "non-null input(s) mean non-null outputs" .
I didn't think about it at all.

regarding v10.
you placed remove_useless_groupby_columns right after add_base_rels_to_query
makes so much sense.
so we can be safely use cached RelOptInfo->indexlist, RelOptInfo->notnullattnums
overall it didn't find any issue.

#22David Rowley
dgrowleyml@gmail.com
In reply to: jian he (#21)
Re: Remove useless GROUP BY columns considering unique index

On Mon, 2 Dec 2024 at 17:22, jian he <jian.universality@gmail.com> wrote:

regarding v10.
you placed remove_useless_groupby_columns right after add_base_rels_to_query
makes so much sense.
so we can be safely use cached RelOptInfo->indexlist, RelOptInfo->notnullattnums
overall it didn't find any issue.

Thanks for looking.

After a bit more adjustment, I've pushed both patches.

I felt it was best to have the commit that moved
remove_useless_groupby_columns also change the call site of that
function.

David

#23David Rowley
dgrowleyml@gmail.com
In reply to: Andrei Lepikhov (#20)
Re: Remove useless GROUP BY columns considering unique index

On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov <lepihov@gmail.com> wrote:

Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
been found.

I did end up adding another fast path there, but I felt like checking
relid > 0 wasn't as good as it could be as that would have only
short-circuited when we don't see any Vars of level 0 in the GROUP BY.
It seemed cheap enough to short-circuit when none of the relations
mentioned in the GROUP BY have multiple columns mentioned.

NOTES:
1. Uniqueness is proved by a UNIQUE index. It might be a bit more
sophisticated to probe not only grouping qual but also employ
equivalence classes. In that case, we could process something like that:

CREATE TABLE test (
x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
CREATE UNIQUE INDEX ON test(y,z);
SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
GROUP BY t1.y,t2.z,t2.w;

It might be worth doing something like that. It looks like we could
delay remove_useless_groupby_columns() until standard_qp_callback
anyway. Further modifications to the GROUP BY can occur there. It
might make sense to replace the call to
make_pathkeys_for_sortclauses_extended() in standard_qp_callback()
with a version of remove_useless_groupby_columns() which does both
tasks plus the one you mention.

However, I don't know what the logic would be exactly for the case you
mention as it seems possible if we start swapping columns out for
another EquivalenceMember that we might cause some regression. For
example, if you had:

create table t1(a int, b int, primary key(a,b);
create table t2(x int, y int);

select ... from t1 inner join t2 on t1.a=t2.x and t1.b = t2.y group by
t2.x,t1.b;

when how do you decide if the GROUP BY should become t1.a,t1.b or
t2.x,t2.y? It's not clear to me that using t1's columns is always
better than using t2's. I imagine using a mix is never better, but I'm
unsure how you'd decide which ones to use.

David

#24Andrei Lepikhov
lepihov@gmail.com
In reply to: David Rowley (#23)
Re: Remove useless GROUP BY columns considering unique index

On 12/12/24 10:09, David Rowley wrote:

On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov <lepihov@gmail.com> wrote:

Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
been found.

I did end up adding another fast path there, but I felt like checking
relid > 0 wasn't as good as it could be as that would have only
short-circuited when we don't see any Vars of level 0 in the GROUP BY.
It seemed cheap enough to short-circuit when none of the relations
mentioned in the GROUP BY have multiple columns mentioned.

Your solution seems much better my proposal. Thanks to apply it!

when how do you decide if the GROUP BY should become t1.a,t1.b or
t2.x,t2.y? It's not clear to me that using t1's columns is always
better than using t2's. I imagine using a mix is never better, but I'm
unsure how you'd decide which ones to use.

Depends on how to calculate that 'better'. Right now, GROUP-BY employs
two strategies to reduce path cost: 1) ORDER-BY statement (avoid final
sorting); 2) To fit incoming subtree pathkeys (avoid grouping presorting).
My idea comes close with [1]Consider the number of columns in the sort cost model /messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com, where the cost depends on the estimated
number of groups in the first grouping column because cost_sort predicts
the number of comparison operator calls based on statistics. In this
case, the choice between (x,y) and (a,b) will depend on the ndistinct of
'x' and 'a'.
In general, it was the idea to debate, more for further development than
for the patch in this thread.

[1]: Consider the number of columns in the sort cost model /messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com
/messages/by-id/8742aaa8-9519-4a1f-91bd-364aec65f5cf@gmail.com

--
regards, Andrei Lepikhov