Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Started by Andrei Lepikhovover 1 year ago9 messages

lepihov@gmail.com

over 1 year ago

1 attachment(s)

Hi,
One database app developer complied that He constantly has a problem
with row estimations in joins over groupings and provided the demo example:

CREATE TABLE line (id int PRIMARY KEY, docId int, amount numeric);
CREATE INDEX line_doc ON line (docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 10) AS id,
generate_series(1, 25000) AS docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 20) AS id,
generate_series(25001, 50000) AS docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 50) AS id,
generate_series(50001, 75000) AS docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 100) AS id,
generate_series(75001, 100000) AS docid);
CREATE TABLE tmp (id int PRIMARY KEY);
INSERT INTO tmp (id) SELECT * FROM generate_series(1, 50);
ANALYZE line, tmp;

EXPLAIN
SELECT tmp.id, sq.amount FROM tmp
LEFT JOIN
(SELECT docid, SUM(amount) AS amount FROM line
JOIN tmp ON tmp.id = docid GROUP BY 1) sq
ON sq.docid = tmp.id;

with this query we have bad estimation of the top JOIN:

Hash Right Join (rows=855)
Hash Cond: (line.docid = tmp.id)
-> GroupAggregate (cost=3.49..117.25 rows=3420 width=36)
Group Key: line.docid
-> Merge Join (cost=3.49..57.40 rows=3420 width=15)
Merge Cond: (line.docid = tmp_1.id)
...

This wrong prediction makes things much worse if the query has more
upper query blocks.
His question was: Why not consider the grouping column unique in the
upper query block? It could improve estimations.
After a thorough investigation, I discovered that in commit 4767bc8ff2
most of the work was already done for DISTINCT clauses. So, why not do
the same for grouping? A sketch of the patch is attached.
As I see it, grouping in this sense works quite similarly to DISTINCT,
and we have no reason to ignore it. After applying the patch, you can
see that prediction has been improved:

Hash Right Join (cost=5.62..162.56 rows=50 width=36)

--
regards, Andrei Lepikhov

Attachments:

0001-Improve-statistics-estimation-considering-GROUP-BY-a.patchtext/plain; charset=UTF-8; name=0001-Improve-statistics-estimation-considering-GROUP-BY-a.patchDownload

From 0c440c2dd4832c155849c347915a64b9878c6f28 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH] Improve statistics estimation considering GROUP-BY as a
 'uniqueiser'.

This commit follows the idea of the 4767bc8ff2. GROUP-BY makes the grouped
column 'highly likely unique'. We can employ this fact in the statistics to
make more precise estimations in the upper query block.
---
 src/backend/utils/adt/selfuncs.c | 37 ++++++++++++++++++++------------
 src/include/utils/selfuncs.h     |  2 +-
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f48..b6ce8978c1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -321,8 +321,8 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This is
 	 * slightly bogus, since the index or clause's equality operator might be
 	 * different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
@@ -483,8 +483,8 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This is
 	 * slightly bogus, since the index or clause's equality operator might be
 	 * different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
@@ -5005,9 +5005,9 @@ ReleaseDummy(HeapTuple tuple)
  *	atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  *		commonly the same as the exposed type of the variable argument,
  *		but can be different in binary-compatible-type cases.
- *	isunique: true if we were able to match the var to a unique index or a
- *		single-column DISTINCT clause, implying its values are unique for
- *		this query.  (Caution: this should be trusted for statistical
+ *	isunique: true if we were able to match the var to a unique index, a
+ *		single-column DISTINCT or GROUP-BY clause, implying its values are
+ *		unique for this query.  (Caution: this should be trusted for statistical
  *		purposes only, since we do not check indimmediate nor verify that
  *		the exact same definition of equality applies.)
  *	acl_ok: true if current user has permission to read the column(s)
@@ -5654,7 +5654,7 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		Assert(IsA(subquery, Query));
 
 		/*
-		 * Punt if subquery uses set operations or GROUP BY, as these will
+		 * Punt if subquery uses set operations or grouping sets, as these will
 		 * mash underlying columns' stats beyond recognition.  (Set ops are
 		 * particularly nasty; if we forged ahead, we would return stats
 		 * relevant to only the leftmost subselect...)	DISTINCT is also
@@ -5662,7 +5662,6 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		 * of learning something even with it.
 		 */
 		if (subquery->setOperations ||
-			subquery->groupClause ||
 			subquery->groupingSets)
 			return;
 
@@ -5692,6 +5691,16 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 			return;
 		}
 
+		/* The same idea as with DISTINCT clause works for a GROUP-BY too */
+		if (subquery->groupClause)
+		{
+			if (list_length(subquery->groupClause) == 1 &&
+				targetIsInSortList(ste, InvalidOid, subquery->groupClause))
+				vardata->isunique = true;
+			/* cannot go further */
+			return;
+		}
+
 		/*
 		 * If the sub-query originated from a view with the security_barrier
 		 * attribute, we must not look at the variable's statistics, though it
@@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
 	}
 
 	/*
-	 * If there is a unique index or DISTINCT clause for the variable, assume
-	 * it is unique no matter what pg_statistic says; the statistics could be
-	 * out of date, or we might have found a partial unique index that proves
-	 * the var is unique for this query.  However, we'd better still believe
-	 * the null-fraction statistic.
+	 * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
+	 * assume it is unique no matter what pg_statistic says; the statistics
+	 * could be out of date, or we might have found a partial unique index that
+	 * proves the var is unique for this query.  However, we'd better still
+	 * believe the null-fraction statistic.
 	 */
 	if (vardata->isunique)
 		stadistinct = -1.0 * (1.0 - stanullfrac);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..b1ac81977b 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -94,7 +94,7 @@ typedef struct VariableStatData
 	Oid			vartype;		/* exposed type of expression */
 	Oid			atttype;		/* actual type (after stripping relabel) */
 	int32		atttypmod;		/* actual typmod (after stripping relabel) */
-	bool		isunique;		/* matches unique index or DISTINCT clause */
+	bool		isunique;		/* matches unique index, DISTINCT or GROUP-BY clause */
 	bool		acl_ok;			/* result of ACL check on table or column */
 } VariableStatData;
 
-- 
2.46.1

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: Andrei Lepikhov (#1)

1 attachment(s)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

On 19/9/2024 09:55, Andrei Lepikhov wrote:

This wrong prediction makes things much worse if the query has more
upper query blocks.
His question was: Why not consider the grouping column unique in the
upper query block? It could improve estimations.
After a thorough investigation, I discovered that in commit 4767bc8ff2
most of the work was already done for DISTINCT clauses. So, why not do
the same for grouping? A sketch of the patch is attached.
As I see it, grouping in this sense works quite similarly to DISTINCT,
and we have no reason to ignore it. After applying the patch, you can
see that prediction has been improved:

Hash Right Join (cost=5.62..162.56 rows=50 width=36)

A regression test is added into new version.
The code looks tiny, simple and non-invasive - it will be easy to commit
or reject. So I add it to next commitfest.

--
regards, Andrei Lepikhov

Attachments:

v2-0001-Improve-statistics-estimation-considering-GROUP-B.patchtext/plain; charset=UTF-8; name=v2-0001-Improve-statistics-estimation-considering-GROUP-B.patchDownload

From 22b572903b8c1a2459b051f91e0902a2fc243648 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH v2] Improve statistics estimation considering GROUP-BY as a
 'uniqueiser'.

This commit follows the idea of the 4767bc8ff2. GROUP-BY makes the grouped
column 'highly likely unique'. We can employ this fact in the statistics to
make more precise estimations in the upper query block.
---
 src/backend/utils/adt/selfuncs.c        | 37 +++++++++++++++----------
 src/include/utils/selfuncs.h            |  2 +-
 src/test/regress/expected/stats_ext.out | 28 +++++++++++++++++++
 src/test/regress/sql/stats_ext.sql      | 19 +++++++++++++
 4 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f48..b6ce8978c1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -321,8 +321,8 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This is
 	 * slightly bogus, since the index or clause's equality operator might be
 	 * different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
@@ -483,8 +483,8 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This is
 	 * slightly bogus, since the index or clause's equality operator might be
 	 * different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
@@ -5005,9 +5005,9 @@ ReleaseDummy(HeapTuple tuple)
  *	atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  *		commonly the same as the exposed type of the variable argument,
  *		but can be different in binary-compatible-type cases.
- *	isunique: true if we were able to match the var to a unique index or a
- *		single-column DISTINCT clause, implying its values are unique for
- *		this query.  (Caution: this should be trusted for statistical
+ *	isunique: true if we were able to match the var to a unique index, a
+ *		single-column DISTINCT or GROUP-BY clause, implying its values are
+ *		unique for this query.  (Caution: this should be trusted for statistical
  *		purposes only, since we do not check indimmediate nor verify that
  *		the exact same definition of equality applies.)
  *	acl_ok: true if current user has permission to read the column(s)
@@ -5654,7 +5654,7 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		Assert(IsA(subquery, Query));
 
 		/*
-		 * Punt if subquery uses set operations or GROUP BY, as these will
+		 * Punt if subquery uses set operations or grouping sets, as these will
 		 * mash underlying columns' stats beyond recognition.  (Set ops are
 		 * particularly nasty; if we forged ahead, we would return stats
 		 * relevant to only the leftmost subselect...)	DISTINCT is also
@@ -5662,7 +5662,6 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		 * of learning something even with it.
 		 */
 		if (subquery->setOperations ||
-			subquery->groupClause ||
 			subquery->groupingSets)
 			return;
 
@@ -5692,6 +5691,16 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 			return;
 		}
 
+		/* The same idea as with DISTINCT clause works for a GROUP-BY too */
+		if (subquery->groupClause)
+		{
+			if (list_length(subquery->groupClause) == 1 &&
+				targetIsInSortList(ste, InvalidOid, subquery->groupClause))
+				vardata->isunique = true;
+			/* cannot go further */
+			return;
+		}
+
 		/*
 		 * If the sub-query originated from a view with the security_barrier
 		 * attribute, we must not look at the variable's statistics, though it
@@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
 	}
 
 	/*
-	 * If there is a unique index or DISTINCT clause for the variable, assume
-	 * it is unique no matter what pg_statistic says; the statistics could be
-	 * out of date, or we might have found a partial unique index that proves
-	 * the var is unique for this query.  However, we'd better still believe
-	 * the null-fraction statistic.
+	 * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
+	 * assume it is unique no matter what pg_statistic says; the statistics
+	 * could be out of date, or we might have found a partial unique index that
+	 * proves the var is unique for this query.  However, we'd better still
+	 * believe the null-fraction statistic.
 	 */
 	if (vardata->isunique)
 		stadistinct = -1.0 * (1.0 - stanullfrac);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index f2563ad1cb..b1ac81977b 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -94,7 +94,7 @@ typedef struct VariableStatData
 	Oid			vartype;		/* exposed type of expression */
 	Oid			atttype;		/* actual type (after stripping relabel) */
 	int32		atttypmod;		/* actual typmod (after stripping relabel) */
-	bool		isunique;		/* matches unique index or DISTINCT clause */
+	bool		isunique;		/* matches unique index, DISTINCT or GROUP-BY clause */
 	bool		acl_ok;			/* result of ACL check on table or column */
 } VariableStatData;
 
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 8c4da95508..dfe9baf4b2 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3074,6 +3074,34 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 (0 rows)
 
 DROP TABLE expr_stats_incompatible_test;
+-- Use the 'GROUP BY x' operator to claim 'x' as a unique column upstream.
+CREATE TABLE gu_1 (x real);
+CREATE TABLE gu_2 (x real);
+CREATE TABLE gu_3 (x numeric);
+INSERT INTO gu_1 (x) VALUES (1);
+INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
+INSERT INTO gu_3 VALUES ('1.'), ('1.0'), ('1.00'), ('1.000'), ('1.0000');
+VACUUM ANALYZE gu_1,gu_2,gu_3;
+SELECT * FROM check_estimated_rows('
+  SELECT * FROM gu_1 LEFT JOIN (
+    SELECT x FROM gu_2 GROUP BY x) AS q1
+    ON gu_1.x=q1.x;');
+ estimated | actual 
+-----------+--------
+         1 |      1
+(1 row)
+
+-- Special case when GROUP BY comparison operator differs from the upper one
+SELECT * FROM check_estimated_rows('
+SELECT * FROM gu_1 LEFT JOIN (
+select x::real from (SELECT x::text FROM gu_3) group by x) AS q1
+  ON gu_1.x=q1.x;');
+ estimated | actual 
+-----------+--------
+         1 |      5
+(1 row)
+
+DROP TABLE gu_1,gu_2,gu_3;
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 0c08a6cc42..5e30007c37 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1547,6 +1547,25 @@ SELECT c0 FROM ONLY expr_stats_incompatible_test WHERE
 
 DROP TABLE expr_stats_incompatible_test;
 
+-- Use the 'GROUP BY x' operator to claim 'x' as a unique column upstream.
+CREATE TABLE gu_1 (x real);
+CREATE TABLE gu_2 (x real);
+CREATE TABLE gu_3 (x numeric);
+INSERT INTO gu_1 (x) VALUES (1);
+INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
+INSERT INTO gu_3 VALUES ('1.'), ('1.0'), ('1.00'), ('1.000'), ('1.0000');
+VACUUM ANALYZE gu_1,gu_2,gu_3;
+SELECT * FROM check_estimated_rows('
+  SELECT * FROM gu_1 LEFT JOIN (
+    SELECT x FROM gu_2 GROUP BY x) AS q1
+    ON gu_1.x=q1.x;');
+-- Special case when GROUP BY comparison operator differs from the upper one
+SELECT * FROM check_estimated_rows('
+SELECT * FROM gu_1 LEFT JOIN (
+select x::real from (SELECT x::text FROM gu_3) group by x) AS q1
+  ON gu_1.x=q1.x;');
+DROP TABLE gu_1,gu_2,gu_3;
+
 -- Permission tests. Users should not be able to see specific data values in
 -- the extended statistics, if they lack permission to see those values in
 -- the underlying table.
-- 
2.46.1

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andrei Lepikhov (#2)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

On 24/09/2024 08:08, Andrei Lepikhov wrote:

On 19/9/2024 09:55, Andrei Lepikhov wrote:

This wrong prediction makes things much worse if the query has more
upper query blocks.
His question was: Why not consider the grouping column unique in the
upper query block? It could improve estimations.
After a thorough investigation, I discovered that in commit
4767bc8ff2 most of the work was already done for DISTINCT clauses. So,
why not do the same for grouping? A sketch of the patch is attached.
As I see it, grouping in this sense works quite similarly to DISTINCT,
and we have no reason to ignore it. After applying the patch, you can
see that prediction has been improved:

Hash Right Join (cost=5.62..162.56 rows=50 width=36)

A regression test is added into new version.
The code looks tiny, simple and non-invasive - it will be easy to commit
or reject. So I add it to next commitfest.

Looks good at a quick glance.

@@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
}

/*
-	 * If there is a unique index or DISTINCT clause for the variable, assume
-	 * it is unique no matter what pg_statistic says; the statistics could be
-	 * out of date, or we might have found a partial unique index that proves
-	 * the var is unique for this query.  However, we'd better still believe
-	 * the null-fraction statistic.
+	 * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
+	 * assume it is unique no matter what pg_statistic says; the statistics
+	 * could be out of date, or we might have found a partial unique index that
+	 * proves the var is unique for this query.  However, we'd better still
+	 * believe the null-fraction statistic.
*/
if (vardata->isunique)
stadistinct = -1.0 * (1.0 - stanullfrac);

I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.

--
Heikki Linnakangas
Neon (https://neon.tech)

Andrei Lepikhov

lepihov@gmail.com

about 1 year ago

In reply to: Heikki Linnakangas (#3)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:

On 24/09/2024 08:08, Andrei Lepikhov wrote:
+     * proves the var is unique for this query.  However, we'd better 
still
+     * believe the null-fraction statistic.
      */
     if (vardata->isunique)
         stadistinct = -1.0 * (1.0 - stanullfrac);
I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.

IMO, in that particular case, it is not an issue: having GROUP-BY, we
set vardata->isunique field and disallowed to recurse into the Var
statistics inside subquery - likewise, DISTINCT already does. So, we
have stanullfrac == 0 - it means the optimiser doesn't count the number
of NULLs. In the case of the UNIQUE index, the optimiser will have the
stanullfrac statistic and count NULLs.

But your question raised one another. May we add to a node some
vardata_extra, which could count specific conditions, and let upper
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in
such a structure for the Aggregate node. Also, it could be a solution to
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

HashAggregate (cost=20.91..22.35 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> HashAggregate (cost=19.11..20.55 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
(actual rows=149 loops=1)
Filter: ((x < '50'::double precision) OR (x IS NULL))
Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

--
regards, Andrei Lepikhov

Alexander Korotkov

aekorotkov@gmail.com

11 months ago

In reply to: Andrei Lepikhov (#4)

1 attachment(s)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
On 24/09/2024 08:08, Andrei Lepikhov wrote:
+     * proves the var is unique for this query.  However, we'd better
still
+     * believe the null-fraction statistic.
*/
if (vardata->isunique)
stadistinct = -1.0 * (1.0 - stanullfrac);
I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we
set vardata->isunique field and disallowed to recurse into the Var
statistics inside subquery - likewise, DISTINCT already does. So, we
have stanullfrac == 0 - it means the optimiser doesn't count the number
of NULLs. In the case of the UNIQUE index, the optimiser will have the
stanullfrac statistic and count NULLs.

This sounds convincing.

But your question raised one another. May we add to a node some
vardata_extra, which could count specific conditions, and let upper
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in
such a structure for the Aggregate node. Also, it could be a solution to
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

HashAggregate (cost=20.91..22.35 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> HashAggregate (cost=19.11..20.55 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
(actual rows=149 loops=1)
Filter: ((x < '50'::double precision) OR (x IS NULL))
Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra? And what information could
it store? Yet seems too sketchy for me to understand.

But, I think for now we should go with the original patch. It seems
to be quite straightforward extension to what 4767bc8ff2 does. I've
revised commit message and applied pg_indent to sources. I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v2-0001-Improve-statistics-estimation-for-single-column-G.patchapplication/octet-stream; name=v2-0001-Improve-statistics-estimation-for-single-column-G.patchDownload

From e7358483edd981e2b32416c7bc4d7899d8e44437 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH v2] Improve statistics estimation for single-column GROUP BY
 in sub-queries

This commit follows the idea of the 4767bc8ff2.  If sub-query has only one
GROUP BY column, we can consider its output variable as being unique. We can
employ this fact in the statistics to make more precise estimations in the
upper query block.

Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
---
 src/backend/utils/adt/selfuncs.c | 49 +++++++++++++++++++-------------
 src/include/utils/selfuncs.h     |  3 +-
 2 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d3d1e485bb2..40712ab718c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -322,10 +322,10 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
-	 * slightly bogus, since the index or clause's equality operator might be
-	 * different from ours, but it's much more likely to be right than
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This
+	 * is slightly bogus, since the index or clause's equality operator might
+	 * be different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
 	 */
 	if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
@@ -484,10 +484,10 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
-	 * slightly bogus, since the index or clause's equality operator might be
-	 * different from ours, but it's much more likely to be right than
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This
+	 * is slightly bogus, since the index or clause's equality operator might
+	 * be different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
 	 */
 	if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
@@ -5018,9 +5018,9 @@ ReleaseDummy(HeapTuple tuple)
  *	atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  *		commonly the same as the exposed type of the variable argument,
  *		but can be different in binary-compatible-type cases.
- *	isunique: true if we were able to match the var to a unique index or a
- *		single-column DISTINCT clause, implying its values are unique for
- *		this query.  (Caution: this should be trusted for statistical
+ *	isunique: true if we were able to match the var to a unique index, a
+ *		single-column DISTINCT or GROUP-BY clause, implying its values are
+ *		unique for this query.  (Caution: this should be trusted for statistical
  *		purposes only, since we do not check indimmediate nor verify that
  *		the exact same definition of equality applies.)
  *	acl_ok: true if current user has permission to read the column(s)
@@ -5680,15 +5680,14 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		Assert(IsA(subquery, Query));
 
 		/*
-		 * Punt if subquery uses set operations or GROUP BY, as these will
-		 * mash underlying columns' stats beyond recognition.  (Set ops are
-		 * particularly nasty; if we forged ahead, we would return stats
+		 * Punt if subquery uses set operations or grouping sets, as these
+		 * will mash underlying columns' stats beyond recognition.  (Set ops
+		 * are particularly nasty; if we forged ahead, we would return stats
 		 * relevant to only the leftmost subselect...)	DISTINCT is also
 		 * problematic, but we check that later because there is a possibility
 		 * of learning something even with it.
 		 */
 		if (subquery->setOperations ||
-			subquery->groupClause ||
 			subquery->groupingSets)
 			return;
 
@@ -5718,6 +5717,16 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 			return;
 		}
 
+		/* The same idea as with DISTINCT clause works for a GROUP-BY too */
+		if (subquery->groupClause)
+		{
+			if (list_length(subquery->groupClause) == 1 &&
+				targetIsInSortList(ste, InvalidOid, subquery->groupClause))
+				vardata->isunique = true;
+			/* cannot go further */
+			return;
+		}
+
 		/*
 		 * If the sub-query originated from a view with the security_barrier
 		 * attribute, we must not look at the variable's statistics, though it
@@ -5869,11 +5878,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
 	}
 
 	/*
-	 * If there is a unique index or DISTINCT clause for the variable, assume
-	 * it is unique no matter what pg_statistic says; the statistics could be
-	 * out of date, or we might have found a partial unique index that proves
-	 * the var is unique for this query.  However, we'd better still believe
-	 * the null-fraction statistic.
+	 * If there is a unique index, DISTINCT or GROUP-BY clause for the
+	 * variable, assume it is unique no matter what pg_statistic says; the
+	 * statistics could be out of date, or we might have found a partial
+	 * unique index that proves the var is unique for this query.  However,
+	 * we'd better still believe the null-fraction statistic.
 	 */
 	if (vardata->isunique)
 		stadistinct = -1.0 * (1.0 - stanullfrac);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index b5e95c0ff61..d35939651f3 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -94,7 +94,8 @@ typedef struct VariableStatData
 	Oid			vartype;		/* exposed type of expression */
 	Oid			atttype;		/* actual type (after stripping relabel) */
 	int32		atttypmod;		/* actual typmod (after stripping relabel) */
-	bool		isunique;		/* matches unique index or DISTINCT clause */
+	bool		isunique;		/* matches unique index, DISTINCT or GROUP-BY
+								 * clause */
 	bool		acl_ok;			/* result of ACL check on table or column */
 } VariableStatData;
 
-- 
2.39.5 (Apple Git-154)

Andrei Lepikhov

lepihov@gmail.com

11 months ago

In reply to: Alexander Korotkov (#5)

1 attachment(s)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

On 17/2/2025 02:06, Alexander Korotkov wrote:

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra? And what information could
it store? Yet seems too sketchy for me to understand.

It is actually sketchy. Our estimation routines have no information
about intermediate modifications of the data. Left-join generated NULLs
is a good example here. So, my vague idea is to maintain that info and
change statistical estimations somehow.
Of course, it is out of the scope here.

But, I think for now we should go with the original patch. It seems
to be quite straightforward extension to what 4767bc8ff2 does. I've
revised commit message and applied pg_indent to sources. I'm going to
push this if no objections.

Ok, I added one regression test to check that feature works properly.

--
regards, Andrei Lepikhov

Attachments:

v3-0001-Improve-statistics-estimation-for-single-column-G.patchtext/plain; charset=UTF-8; name=v3-0001-Improve-statistics-estimation-for-single-column-G.patchDownload

From d2e70f544b077de4f5b85838d6071ab0243460ce Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH v3] Improve statistics estimation for single-column GROUP BY
 in sub-queries

This commit follows the idea of the 4767bc8ff2.  If sub-query has only one
GROUP BY column, we can consider its output variable as being unique. We can
employ this fact in the statistics to make more precise estimations in the
upper query block.

Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
---
 src/backend/utils/adt/selfuncs.c        | 49 +++++++++++++++----------
 src/include/utils/selfuncs.h            |  3 +-
 src/test/regress/expected/stats_ext.out | 17 +++++++++
 src/test/regress/sql/stats_ext.sql      | 14 +++++++
 4 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d3d1e485bb..40712ab718 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -322,10 +322,10 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
-	 * slightly bogus, since the index or clause's equality operator might be
-	 * different from ours, but it's much more likely to be right than
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This
+	 * is slightly bogus, since the index or clause's equality operator might
+	 * be different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
 	 */
 	if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
@@ -484,10 +484,10 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid collation,
 	}
 
 	/*
-	 * If we matched the var to a unique index or DISTINCT clause, assume
-	 * there is exactly one match regardless of anything else.  (This is
-	 * slightly bogus, since the index or clause's equality operator might be
-	 * different from ours, but it's much more likely to be right than
+	 * If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+	 * assume there is exactly one match regardless of anything else.  (This
+	 * is slightly bogus, since the index or clause's equality operator might
+	 * be different from ours, but it's much more likely to be right than
 	 * ignoring the information.)
 	 */
 	if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
@@ -5018,9 +5018,9 @@ ReleaseDummy(HeapTuple tuple)
  *	atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  *		commonly the same as the exposed type of the variable argument,
  *		but can be different in binary-compatible-type cases.
- *	isunique: true if we were able to match the var to a unique index or a
- *		single-column DISTINCT clause, implying its values are unique for
- *		this query.  (Caution: this should be trusted for statistical
+ *	isunique: true if we were able to match the var to a unique index, a
+ *		single-column DISTINCT or GROUP-BY clause, implying its values are
+ *		unique for this query.  (Caution: this should be trusted for statistical
  *		purposes only, since we do not check indimmediate nor verify that
  *		the exact same definition of equality applies.)
  *	acl_ok: true if current user has permission to read the column(s)
@@ -5680,15 +5680,14 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		Assert(IsA(subquery, Query));
 
 		/*
-		 * Punt if subquery uses set operations or GROUP BY, as these will
-		 * mash underlying columns' stats beyond recognition.  (Set ops are
-		 * particularly nasty; if we forged ahead, we would return stats
+		 * Punt if subquery uses set operations or grouping sets, as these
+		 * will mash underlying columns' stats beyond recognition.  (Set ops
+		 * are particularly nasty; if we forged ahead, we would return stats
 		 * relevant to only the leftmost subselect...)	DISTINCT is also
 		 * problematic, but we check that later because there is a possibility
 		 * of learning something even with it.
 		 */
 		if (subquery->setOperations ||
-			subquery->groupClause ||
 			subquery->groupingSets)
 			return;
 
@@ -5718,6 +5717,16 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 			return;
 		}
 
+		/* The same idea as with DISTINCT clause works for a GROUP-BY too */
+		if (subquery->groupClause)
+		{
+			if (list_length(subquery->groupClause) == 1 &&
+				targetIsInSortList(ste, InvalidOid, subquery->groupClause))
+				vardata->isunique = true;
+			/* cannot go further */
+			return;
+		}
+
 		/*
 		 * If the sub-query originated from a view with the security_barrier
 		 * attribute, we must not look at the variable's statistics, though it
@@ -5869,11 +5878,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
 	}
 
 	/*
-	 * If there is a unique index or DISTINCT clause for the variable, assume
-	 * it is unique no matter what pg_statistic says; the statistics could be
-	 * out of date, or we might have found a partial unique index that proves
-	 * the var is unique for this query.  However, we'd better still believe
-	 * the null-fraction statistic.
+	 * If there is a unique index, DISTINCT or GROUP-BY clause for the
+	 * variable, assume it is unique no matter what pg_statistic says; the
+	 * statistics could be out of date, or we might have found a partial
+	 * unique index that proves the var is unique for this query.  However,
+	 * we'd better still believe the null-fraction statistic.
 	 */
 	if (vardata->isunique)
 		stadistinct = -1.0 * (1.0 - stanullfrac);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index b5e95c0ff6..d35939651f 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -94,7 +94,8 @@ typedef struct VariableStatData
 	Oid			vartype;		/* exposed type of expression */
 	Oid			atttype;		/* actual type (after stripping relabel) */
 	int32		atttypmod;		/* actual typmod (after stripping relabel) */
-	bool		isunique;		/* matches unique index or DISTINCT clause */
+	bool		isunique;		/* matches unique index, DISTINCT or GROUP-BY
+								 * clause */
 	bool		acl_ok;			/* result of ACL check on table or column */
 } VariableStatData;
 
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 9a820404d3..7125665a94 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -3368,3 +3368,20 @@ NOTICE:  drop cascades to 2 other objects
 DETAIL:  drop cascades to table tststats.priv_test_tbl
 drop cascades to view tststats.priv_test_view
 DROP USER regress_stats_user1;
+CREATE TABLE grouping_unique_1 (x integer);
+CREATE TABLE grouping_unique_2 (x integer);
+INSERT INTO grouping_unique_1 (x) VALUES (1);
+INSERT INTO grouping_unique_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
+ANALYZE grouping_unique_1,grouping_unique_2;
+-- Optimiser treat GROUP-BY operator as an 'uniqueser' of the input
+SELECT * FROM check_estimated_rows('
+  SELECT * FROM grouping_unique_1 t1 LEFT JOIN (
+    SELECT x FROM grouping_unique_2 t2 GROUP BY x) AS q1
+  ON t1.x=q1.x;
+');
+ estimated | actual 
+-----------+--------
+         1 |      1
+(1 row)
+
+DROP TABLE grouping_unique_1, grouping_unique_2;
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 75b04e5a13..ca327f69b3 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -1707,3 +1707,17 @@ RESET SESSION AUTHORIZATION;
 DROP TABLE stats_ext_tbl;
 DROP SCHEMA tststats CASCADE;
 DROP USER regress_stats_user1;
+
+CREATE TABLE grouping_unique_1 (x integer);
+CREATE TABLE grouping_unique_2 (x integer);
+INSERT INTO grouping_unique_1 (x) VALUES (1);
+INSERT INTO grouping_unique_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
+ANALYZE grouping_unique_1,grouping_unique_2;
+
+-- Optimiser treat GROUP-BY operator as an 'uniqueser' of the input
+SELECT * FROM check_estimated_rows('
+  SELECT * FROM grouping_unique_1 t1 LEFT JOIN (
+    SELECT x FROM grouping_unique_2 t2 GROUP BY x) AS q1
+  ON t1.x=q1.x;
+');
+DROP TABLE grouping_unique_1, grouping_unique_2;
-- 
2.48.1

Vlada Pogozhelskaya

v.pogozhelskaya@postgrespro.ru

11 months ago

In reply to: Alexander Korotkov (#5)

1 attachment(s)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Hi all,

Following the discussion on improving statistics estimation by
considering GROUP BY as a unique constraint, I’ve prepared a patch that
integrates GROUP BY into cardinality estimation in a similar way to
DISTINCT.

This patch ensures that when a subquery contains a GROUP BY clause, the
optimizer recognizes the grouped columns as unique. The logic follows a
straightforward approach, comparing the GROUP BY columns with the target
list to determine uniqueness.

I’d appreciate any feedback or suggestions for further improvements.

---
regards,
Vlada Pogozhelskaya

Show quoted text

On 17.02.2025 08:06, Alexander Korotkov wrote:

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov<lepihov@gmail.com> wrote:
Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
On 24/09/2024 08:08, Andrei Lepikhov wrote:
+     * proves the var is unique for this query.  However, we'd better
still
+     * believe the null-fraction statistic.
*/
if (vardata->isunique)
stadistinct = -1.0 * (1.0 - stanullfrac);
I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we
set vardata->isunique field and disallowed to recurse into the Var
statistics inside subquery - likewise, DISTINCT already does. So, we
have stanullfrac == 0 - it means the optimiser doesn't count the number
of NULLs. In the case of the UNIQUE index, the optimiser will have the
stanullfrac statistic and count NULLs.
This sounds convincing.

But your question raised one another. May we add to a node some
vardata_extra, which could count specific conditions, and let upper
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in
such a structure for the Aggregate node. Also, it could be a solution to
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

HashAggregate (cost=20.91..22.35 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> HashAggregate (cost=19.11..20.55 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
(actual rows=149 loops=1)
Filter: ((x < '50'::double precision) OR (x IS NULL))
Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra? And what information could
it store? Yet seems too sketchy for me to understand.

But, I think for now we should go with the original patch. It seems
to be quite straightforward extension to what 4767bc8ff2 does. I've
revised commit message and applied pg_indent to sources. I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v3-0001-Improve-statistics-estimation-for-GROUP-BY.patchtext/plain; charset=UTF-8; name=v3-0001-Improve-statistics-estimation-for-GROUP-BY.patchDownload

From 7e8a7c74c4e7312c38ccb7b9c4ceb8c129d411c1 Mon Sep 17 00:00:00 2001
From: Vlada Pogozhelskaya <pogozhelskaya@gmail.com>
Date: Fri, 31 Jan 2025 11:16:28 +0700
Subject: [PATCH] Use uniqueness from group by for statistics
 estimation

Tags: optimizer
---
 src/backend/utils/adt/selfuncs.c | 38 +++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f48d61010c5..e85409c1ff8 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5803,7 +5803,7 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 		 * of learning something even with it.
 		 */
 		if (subquery->setOperations ||
-			subquery->groupClause ||
+			// subquery->groupClause ||
 			subquery->groupingSets)
 			return;
 
@@ -5839,6 +5839,42 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 				 rte->eref->aliasname, var->varattno);
 		var = (Var *) ste->expr;
 
+	if (subquery->groupClause)
+	{
+		List *groupVars = NIL;
+		List *targetVars = NIL;
+		ListCell *lc;
+
+		/* Collect unique expressions from GROUP BY */
+		foreach (lc, subquery->groupClause)
+		{
+			SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+			TargetEntry *tle = get_sortgroupref_tle(sgc->tleSortGroupRef, subquery->targetList);
+
+			if (tle && tle->expr)
+				groupVars = list_append_unique(groupVars, tle->expr);
+		}
+
+		/* Collect unique expressions from the target list */
+		foreach (lc, subquery->targetList)
+		{
+			TargetEntry *targetTle = (TargetEntry *) lfirst(lc);
+			if (targetTle && targetTle->expr)
+				targetVars = list_append_unique(targetVars, targetTle->expr);
+		}
+
+		if (equal(groupVars, targetVars))
+		{
+			vardata->isunique = true;
+		}
+
+		list_free(groupVars);
+		list_free(targetVars);
+	}
+
+
+
+
 		/*
 		 * If subquery uses DISTINCT, we can't make use of any stats for the
 		 * variable ... but, if it's the only DISTINCT column, we are entitled
-- 
2.46.0

Alexander Korotkov

aekorotkov@gmail.com

11 months ago

In reply to: Vlada Pogozhelskaya (#7)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

Hi, Vlada.

On Tue, Feb 18, 2025 at 6:56 PM Vlada Pogozhelskaya
<v.pogozhelskaya@postgrespro.ru> wrote:

Following the discussion on improving statistics estimation by considering GROUP BY as a unique constraint, I’ve prepared a patch that integrates GROUP BY into cardinality estimation in a similar way to DISTINCT.

This patch ensures that when a subquery contains a GROUP BY clause, the optimizer recognizes the grouped columns as unique. The logic follows a straightforward approach, comparing the GROUP BY columns with the target list to determine uniqueness.

I’d appreciate any feedback or suggestions for further improvements.

Thank you for your patch, but your message lacking of explanation on
what is your approach and how is it different from previously
published patches on this thread. As I get from the code, you check
if group by clauses are same to targetlist. If that's true, you
assume every column to be unique. But that's just doesn't work this
way. If values are unique in some set of columns, individual columns
might have repeats. See the example.

# select x, y from generate_series(1,3) x, generate_series(1, 3) y
group by x, y;
x | y
---+---
3 | 2
2 | 2
3 | 1
2 | 1
1 | 3
1 | 2
1 | 1
2 | 3
3 | 3
(9 rows)

x and y are unique here as a pair. But individual x and y values have repeats.

------
Regards,
Alexander Korotkov
Supabase

Alexander Korotkov

aekorotkov@gmail.com

11 months ago

In reply to: Andrei Lepikhov (#6)

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

On Tue, Feb 18, 2025 at 2:52 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 17/2/2025 02:06, Alexander Korotkov wrote:

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra? And what information could
it store? Yet seems too sketchy for me to understand.

It is actually sketchy. Our estimation routines have no information
about intermediate modifications of the data. Left-join generated NULLs
is a good example here. So, my vague idea is to maintain that info and
change statistical estimations somehow.
Of course, it is out of the scope here.

But, I think for now we should go with the original patch. It seems
to be quite straightforward extension to what 4767bc8ff2 does. I've
revised commit message and applied pg_indent to sources. I'm going to
push this if no objections.

Ok, I added one regression test to check that feature works properly.

Andrei, thank you. I've pushed the patch applying some simplification
of regression test.

------
Regards,
Alexander Korotkov
Supabase