Removing Functionally Dependent GROUP BY Columns

Started by David Rowleyabout 10 years ago30 messages
#1David Rowley
david.rowley@2ndquadrant.com
1 attachment(s)

We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.

For example a query such as;

SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;

is perfectly fine in PostgreSQL, as p.description is functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).

It seems that there's no shortage of relational databases in existence
today which don't support this. These databases would require the GROUP BY
clause to include the p.description column too.

It seems rather unfortunate that people who migrate applications to
PostgreSQL may not be aware that we support this, as currently if we
needlessly include the p.description column, PostgreSQL naively includes
this column while grouping. These people could well be incurring a
performance penalty due to our planner not removing the useless items from
the list, as if the primary key is present, then including any other
columns won't cause splitting of the groups any further, all other columns
from the *same relation* can simply be removed from the GROUP BY clause.

There are in fact also two queries in TPC-H (Q10 and Q18) which are written
to include all of the non-aggregated column in the GROUP BY list. During a
recent test I witnessed a 50% gain in performance in Q10 by removing the
unneeded columns from the GROUP BY clause.

I've attached a patch which implements this in PostgreSQL.

The patch may need a little more work in order to adjust the targetlist's
tleSortGroupRefs to remove invalid ones and perhaps also remove the gaps.

I'm posting this now so that I can gauge the community interest in this.

Is it something that we'd like to have in PostgreSQL?

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

prune_group_by_clause_2027f512_2015-12-01.patchapplication/octet-stream; name=prune_group_by_clause_2027f512_2015-12-01.patchDownload
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 17f5bce..2bd0df9 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -871,6 +871,128 @@ get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok)
 }
 
 /*
+ * identify_key_vars
+ *		Returns a Bitmapset with bits set to mark the 0-based index in varlist
+ *		where PRIMARY KEY Vars were found in the list. All columns of the
+ *		primary key constraint must be present in varlist.
+ *
+ *	On successful identification of primary key columns 'constraintOid' is set
+ *	to the OID of the primary key constraint. On failure to match all primary
+ *	key columns, NULL is returned and 'constraintOid' is left unchanged.
+ */
+Bitmapset *
+identify_key_vars(Oid relid, Index varno, Index varlevelsup, List *varlist,
+				  Oid *constraintOid)
+{
+	Relation	pg_constraint;
+	HeapTuple	tuple;
+	SysScanDesc scan;
+	ScanKeyData skey[1];
+	Bitmapset	*varlistmatches;
+
+	/* Scan pg_constraint for constraints of the target rel */
+	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_constraint_conrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
+							  NULL, 1, skey);
+
+	varlistmatches = NULL;
+
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+		Datum		adatum;
+		bool		isNull;
+		ArrayType  *arr;
+		int16	   *attnums;
+		int			numkeys;
+		int			i;
+
+		/* Only PK constraints are of interest for now, see comment above */
+		if (con->contype != CONSTRAINT_PRIMARY)
+			continue;
+
+		/*
+		 * Exit if the constraint is deferrable, there's no point in looking
+		 * for another PK constriant, as there can only be one.
+		 */
+		if (con->condeferrable)
+			break;
+
+		/* Extract the conkey array, ie, attnums of PK's columns */
+		adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+							  RelationGetDescr(pg_constraint), &isNull);
+		if (isNull)
+			elog(ERROR, "null conkey for constraint %u",
+				 HeapTupleGetOid(tuple));
+		arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+		numkeys = ARR_DIMS(arr)[0];
+		if (ARR_NDIM(arr) != 1 ||
+			numkeys < 0 ||
+			ARR_HASNULL(arr) ||
+			ARR_ELEMTYPE(arr) != INT2OID)
+			elog(ERROR, "conkey is not a 1-D smallint array");
+		attnums = (int16 *) ARR_DATA_PTR(arr);
+
+		for (i = 0; i < numkeys; i++)
+		{
+			AttrNumber	attnum = attnums[i];
+			ListCell   *lc;
+			bool found_col = false;
+			int varlistidx = 0;
+
+			foreach(lc, varlist)
+			{
+				Var		   *gvar = (Var *) lfirst(lc);
+
+				if (IsA(gvar, Var) &&
+					gvar->varno == varno &&
+					gvar->varlevelsup == varlevelsup &&
+					gvar->varattno == attnum)
+				{
+					varlistmatches = bms_add_member(varlistmatches, varlistidx);
+					found_col = true;
+					/* don't break, there might be dupicates */
+				}
+				varlistidx++;
+			}
+
+			if (!found_col)
+			{
+				/*
+				 * No match found for this index key. Since the relation has
+				 * only one PRIMARY KEY we can skip looking through any other
+				 * constraints.
+				 */
+				varlistmatches = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * Did we match all PRIMARY KEY Vars? If so we can return this match.
+		 * There's no need to look for a better match as a relation can only
+		 * have one PRIMARY KEY constraint.
+		 */
+		if (i == numkeys)
+			*constraintOid = HeapTupleGetOid(tuple);
+
+		break;
+	}
+
+	systable_endscan(scan);
+
+	heap_close(pg_constraint, AccessShareLock);
+
+	return varlistmatches;
+}
+
+/*
  * Determine whether a relation can be proven functionally dependent on
  * a set of grouping columns.  If so, return TRUE and add the pg_constraint
  * OIDs of the constraints needed for the proof to the *constraintDeps list.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a9cccee..9be0dbe 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "access/xact.h"
+#include "catalog/pg_constraint.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "foreign/fdwapi.h"
@@ -3147,6 +3148,152 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
+/*
+ * remove_useless_groupby_columns
+ *		Removes columns from the GROUP BY clause which are redundant due to
+ *		being functionally dependant on one or more other column which are
+ *		present.
+ *
+ * We do support non-aggregated columns in the query targetlist which are
+ * functionally dependant on the primary key, but not all relational databases
+ * allow this, so it can be fairly common that queries are written to have the
+ * GROUP BY clause include all of the columns which are not aggregated in the
+ * query's targetlist. These columns have no need to be there as long as the
+ * columns which make up the table's PRIMARY KEY are present in the GROUP BY
+ * clause. Here we test to see if the primary key columns are present for each
+ * relation which make up the columns of the GROUP BY clause, and if so we
+ * remove any which are not part of the primary key.
+ *
+ * In reality we could do more than just the primary key columns here as UNIQUE
+ * indexes with NOT NULL constraints on the columns allow us to also determine
+ * functional dependencies. However we mustn't go any further than what's done
+ * by check_functional_grouping().
+ */
+static void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	List		  **relvarlist;
+	Bitmapset	  **surplusvars;
+	ListCell	   *lc;
+	int				relid;
+	bool			anysurplus = false;
+	Query		   *parse = root->parse;
+
+	if (parse->groupingSets)
+		return;
+
+	/* Nothing do to when there's only 1 group by expression */
+	if (list_length(parse->groupClause) < 2)
+		return;
+
+	relvarlist = (List **) palloc0(sizeof(List *) * (list_length(parse->rtable) + 1));
+	surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) * (list_length(parse->rtable) + 1));
+
+	/*
+	 * Pre-process the GROUP BY clause Vars from each relation into
+	 * 'relvarlist' indexed by the relid
+	 */
+	foreach (lc, parse->groupClause)
+	{
+		SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+		TargetEntry *tle;
+		Var *var;
+
+		tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		var = (Var *) tle->expr;
+
+		if (!IsA(var, Var))
+			continue;
+
+		Assert(var->varno <= list_length(parse->rtable));
+		relvarlist[var->varno] = lappend(relvarlist[var->varno], var);
+	}
+
+	relid = 1;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+		List *varlist = relvarlist[relid];
+		Bitmapset *varlistmatches;
+		Oid		  constraintOid;
+		ListCell *lc2;
+
+		/*
+		 * Skip if there were no Vars found in the GROUP BY clause which belong
+		 * to this relation. We can also skip if there's only 1 Var, as there
+		 * needs to be more than one Var for there to be any columns which are
+		 * functionally dependent on another column.
+		 */
+		if (varlist == NULL || list_length(varlist) < 2)
+			continue;
+
+		varlistmatches = identify_key_vars(rte->relid, relid, 0, varlist,
+										   &constraintOid);
+
+		/*
+		 * If we have some varlist matches and varlist has additional columns,
+		 * then each of these additional column may be removed from the
+		 * GROUP BY clause. We'll mark the varattno of each variable that's not
+		 * needed.
+		 */
+		if (varlistmatches != NULL &&
+			bms_num_members(varlistmatches) < list_length(varlist))
+		{
+			int varidx = 0;
+			foreach(lc2, varlist)
+			{
+				Var *var = (Var *) lfirst(lc2);
+
+				if (!bms_is_member(varidx, varlistmatches))
+				{
+					surplusvars[relid] = bms_add_member(surplusvars[relid],
+														var->varattno);
+				}
+
+				varidx++;
+			}
+
+			/*
+			 * Mark that the GROUP BY clause needs to be rebuilt. We'll do this
+			 * just once when we've completely processing any other relation's
+			 * columns which are present in the GROUP BY clause.
+			 */
+			anysurplus = true;
+			parse->constraintDeps = lappend_oid(parse->constraintDeps,
+												constraintOid);
+		}
+		relid++;
+	}
+
+	/*
+	 * If we found any surplus Vars in the GROUP BY clause, then we'll build
+	 * a new GROUP BY clause without these surplus Vars.
+	 */
+	if (anysurplus)
+	{
+		List *new_groupby = NIL;
+
+		foreach (lc, root->parse->groupClause)
+		{
+			SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+			TargetEntry *tle;
+			Var *var;
+
+			tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+			var = (Var *) tle->expr;
+
+			if (!IsA(var, Var))
+				continue;
+
+			if (!bms_is_member(var->varattno, surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+
+			/* XXX do we to alter tleSortGroupRef to remove gaps? */
+		}
+		root->parse->groupClause = new_groupby;
+	}
+}
+
 
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
@@ -3192,6 +3339,12 @@ preprocess_groupclause(PlannerInfo *root, List *force)
 		return new_groupclause;
 	}
 
+	/*
+	 * Remove any GROUP BY clause items which are functionally dependent on
+	 * other clause items
+	 */
+	remove_useless_groupby_columns(root);
+
 	/* If no ORDER BY, nothing useful to do here */
 	if (parse->sortClause == NIL)
 		return parse->groupClause;
diff --git a/src/include/catalog/pg_constraint.h b/src/include/catalog/pg_constraint.h
index feafc35..087f046 100644
--- a/src/include/catalog/pg_constraint.h
+++ b/src/include/catalog/pg_constraint.h
@@ -250,6 +250,9 @@ extern void AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
 extern Oid	get_relation_constraint_oid(Oid relid, const char *conname, bool missing_ok);
 extern Oid	get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok);
 
+extern Bitmapset *identify_key_vars(Oid relid, Index varno, Index varlevelsup,
+									List *varlist, Oid *constraintOid);
+
 extern bool check_functional_grouping(Oid relid,
 						  Index varno, Index varlevelsup,
 						  List *grouping_columns,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index f8a9f3f..48fda52 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3724,19 +3724,19 @@ select d.* from d left join (select distinct * from b) s
 -- not in the join condition
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
-  on d.a = s.id;
-                 QUERY PLAN                  
----------------------------------------------
+  on d.a = s.c_id;
+              QUERY PLAN               
+---------------------------------------
  Merge Left Join
-   Merge Cond: (d.a = s.id)
+   Merge Cond: (d.a = s.c_id)
    ->  Sort
          Sort Key: d.a
          ->  Seq Scan on d
    ->  Sort
-         Sort Key: s.id
+         Sort Key: s.c_id
          ->  Subquery Scan on s
                ->  HashAggregate
-                     Group Key: b.id, b.c_id
+                     Group Key: b.id
                      ->  Seq Scan on b
 (11 rows)
 
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 528e1ef..f1c8050 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1188,7 +1188,7 @@ select d.* from d left join (select distinct * from b) s
 -- not in the join condition
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
-  on d.a = s.id;
+  on d.a = s.c_id;
 
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
#2Marko Tiikkaja
marko@joh.to
In reply to: David Rowley (#1)
Re: Removing Functionally Dependent GROUP BY Columns

On 2015-12-01 05:00, David Rowley wrote:

We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.

For example a query such as;

SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;

is perfectly fine in PostgreSQL, as p.description is functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).

This has come up before (on other forums, at least), and my main concern
has been that unlike the case where we go from throwing an error to
allowing a query, this has a chance to make the planning of currently
legal queries slower. Have you tried to measure the impact of this on
queries where there's no runtime gains to be had?

.m

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3David Rowley
david.rowley@2ndquadrant.com
In reply to: Marko Tiikkaja (#2)
2 attachment(s)
Re: Removing Functionally Dependent GROUP BY Columns

On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to> wrote:

On 2015-12-01 05:00, David Rowley wrote:

We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.

For example a query such as;

SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;

is perfectly fine in PostgreSQL, as p.description is functionally
dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).

This has come up before (on other forums, at least), and my main concern
has been that unlike the case where we go from throwing an error to
allowing a query, this has a chance to make the planning of currently legal
queries slower. Have you tried to measure the impact of this on queries
where there's no runtime gains to be had?

I've performed a series of benchmarks on the following queries:

Test1: explain select id1,id2 from t1 group by id1,id2;
Test2: explain select id from t2 group by id;
Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on t1.id1=t2.id
group by t1.id1,t1.id2;

I ran each of these with pgbench for 60 seconds, 3 runs per query. In each
case below I've converted the TPS into seconds using the average TPS over
the 3 runs.

In summary:

Test1 is the worst case test. It's a very simple query so planning overhead
of join searching is non-existent. The fact that there's 2 columns in the
GROUP BY means that the fast path cannot be used. I added this as if
there's only 1 column in the GROUP BY then there's no point in searching
for something to remove.

Average (Sec)
Master 0.0001043117
Patched 0.0001118961
Performance 93.22%
Microseconds of planning overhead 7.5844326722

Test2 is a simple query with a GROUP BY which can fast path due to there
being only 1 GROUP BY column.

Average (Sec)
Master 0.000099374448
Patched 0.000099670124
Performance 99.70%
Microseconds of planning overhead 0.2956763193

Test3 is a slightly more complex and is aimed to show that the percentage
of planning overhead is smaller when joins exist and overall planning cost
becomes higher

Average (Sec)
Master 0.0001797165
Patched 0.0001798406
Performance 99.93%
Microseconds of planning overhead 0.1240776236

Test3 results seem a bit strange, I would have expected more of a slowdown.
I ran the test again to make sure, and it came back with the same results
the 2nd time.

I've attached the spreadsheet that used to collect the results, and also
the raw pgbench output.

It seems that the worst case test adds about 7.6 microseconds onto planning
time. To get this worse case result I had to add two GROUP BY columns, as
having only 1 triggers a fast path as the code knows it can't remove any
columns, since there's only 1. A similar fast path also exists which will
only lookup the PRIMARY KEY details if there's more than 1 column per
relation in the GROUP BY, so for example GROUP BY rel1.col1, rel2.col1
won't lookup any PRIMARY KEY constraint.

Given that the extra code really only does anything if the GROUP BY has 2
or more expressions, are you worried that this will affect too many short
and fast to execute queries negatively?

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

raw_pgbench_output.txttext/plain; charset=US-ASCII; name=raw_pgbench_output.txtDownload
Benchmark.odsapplication/vnd.oasis.opendocument.spreadsheet; name=Benchmark.odsDownload
#4Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#1)
Re: Removing Functionally Dependent GROUP BY Columns

On Mon, Nov 30, 2015 at 11:00 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

There are in fact also two queries in TPC-H (Q10 and Q18) which are written
to include all of the non-aggregated column in the GROUP BY list. During a
recent test I witnessed a 50% gain in performance in Q10 by removing the
unneeded columns from the GROUP BY clause.

Wow, that's pretty impressive. +1 for trying to do something about this.

(Full disclosure: I have not read your patch.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Peter Eisentraut
peter_e@gmx.net
In reply to: David Rowley (#1)
Re: Removing Functionally Dependent GROUP BY Columns

On 11/30/15 11:00 PM, David Rowley wrote:

It seems that there's no shortage of relational databases in existence
today which don't support this. These databases would require the GROUP
BY clause to include the p.description column too.

Well, actually, we implemented this because other databases had it and
users kept complaining to us.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6David Rowley
david.rowley@2ndquadrant.com
In reply to: Peter Eisentraut (#5)
Re: Removing Functionally Dependent GROUP BY Columns

On 3 December 2015 at 14:28, Peter Eisentraut <peter_e@gmx.net> wrote:

On 11/30/15 11:00 PM, David Rowley wrote:

It seems that there's no shortage of relational databases in existence
today which don't support this. These databases would require the GROUP
BY clause to include the p.description column too.

Well, actually, we implemented this because other databases had it and
users kept complaining to us.

Most likely you mean MySQL? I believe there was a bit of an influx in
people emigrating away from that database not in the too distant past. It's
not too surprising we picked up some of those people, and some of the
features that they were used to at the same time... IF NOT EXISTS perhaps
is a good example to back that up.

However, I think your comment makes it quite clear that we're not talking
about the same database management systems.

For me, I tested the most popular 2 commercial ones and found neither of
these supported columns in the SELECT list which were functionally
dependent on the GROUP BY clause, unless the columns themselves were in the
GROUP BY clause. I admit my "no shortage" comment was based on these two
only. I also admit that I don't possess any statistics which show which
RDBMSs users are migrating from. I merely thought that testing the two most
popular commercial ones was enough justification to write what I wrote.

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

#7Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: David Rowley (#3)
Re: Removing Functionally Dependent GROUP BY Columns

On 01/12/2015 12:07, David Rowley wrote:

On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to
<mailto:marko@joh.to>> wrote:

On 2015-12-01 05:00, David Rowley wrote:

We already allow a SELECT's target list to contain
non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.

For example a query such as;

SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;

is perfectly fine in PostgreSQL, as p.description is
functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).

This has come up before (on other forums, at least), and my main
concern has been that unlike the case where we go from throwing an
error to allowing a query, this has a chance to make the planning of
currently legal queries slower. Have you tried to measure the
impact of this on queries where there's no runtime gains to be had?

I've performed a series of benchmarks on the following queries:

Test1: explain select id1,id2 from t1 group by id1,id2;
Test2: explain select id from t2 group by id;
Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on
t1.id1=t2.id <http://t2.id&gt; group by t1.id1,t1.id2;

I ran each of these with pgbench for 60 seconds, 3 runs per query. In
each case below I've converted the TPS into seconds using the average
TPS over the 3 runs.

In summary:

Test1 is the worst case test. It's a very simple query so planning
overhead of join searching is non-existent. The fact that there's 2
columns in the GROUP BY means that the fast path cannot be used. I added
this as if there's only 1 column in the GROUP BY then there's no point
in searching for something to remove.

[...]

It seems that the worst case test adds about 7.6 microseconds onto
planning time. To get this worse case result I had to add two GROUP BY
columns, as having only 1 triggers a fast path as the code knows it
can't remove any columns, since there's only 1. A similar fast path also
exists which will only lookup the PRIMARY KEY details if there's more
than 1 column per relation in the GROUP BY, so for example GROUP BY
rel1.col1, rel2.col1 won't lookup any PRIMARY KEY constraint.

Given that the extra code really only does anything if the GROUP BY has
2 or more expressions, are you worried that this will affect too many
short and fast to execute queries negatively?

In identify_key_vars()

+       /* Only PK constraints are of interest for now, see comment above */
+       if (con->contype != CONSTRAINT_PRIMARY)
+           continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).

+       /*
+        * Exit if the constraint is deferrable, there's no point in looking
+        * for another PK constriant, as there can only be one.
+        */

small typo on "constriant"

+               {
+                   varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+                   found_col = true;
+                   /* don't break, there might be dupicates */
+               }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?

in remove_useless_groupby_columns()

+       /*
+        * Skip if there were no Vars found in the GROUP BY clause which
belong
+        * to this relation. We can also skip if there's only 1 Var, as
there
+        * needs to be more than one Var for there to be any columns
which are
+        * functionally dependent on another column.
+        */

This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.

+   /*
+    * If we found any surplus Vars in the GROUP BY clause, then we'll build
+    * a new GROUP BY clause without these surplus Vars.
+    */
+   if (anysurplus)
+   {
+       List *new_groupby = NIL;
+
+       foreach (lc, root->parse->groupClause)
+       {
+           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+           TargetEntry *tle;
+           Var *var;
+
+           tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+           var = (Var *) tle->expr;
+
+           if (!IsA(var, Var))
+               continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.

+ /* XXX do we to alter tleSortGroupRef to remove gaps? */

no idea on that :/

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8David Rowley
david.rowley@2ndquadrant.com
In reply to: Julien Rouhaud (#7)
1 attachment(s)
Re: Removing Functionally Dependent GROUP BY Columns

Many thanks for the thorough review!

On 12 January 2016 at 23:37, Julien Rouhaud <julien.rouhaud@dalibo.com>
wrote:

In identify_key_vars()

+       /* Only PK constraints are of interest for now, see comment above
*/
+       if (con->contype != CONSTRAINT_PRIMARY)
+           continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).

I've improved this by explaining it more clearly.

+       /*
+        * Exit if the constraint is deferrable, there's no point in
looking
+        * for another PK constriant, as there can only be one.
+        */

small typo on "constriant"

Fixed.

+               {
+                   varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+                   found_col = true;
+                   /* don't break, there might be dupicates */
+               }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?

No I missed something :) You are right. I've put a break; here and a
comment to explain why.

in remove_useless_groupby_columns()

+       /*
+        * Skip if there were no Vars found in the GROUP BY clause which
belong
+        * to this relation. We can also skip if there's only 1 Var, as
there
+        * needs to be more than one Var for there to be any columns
which are
+        * functionally dependent on another column.
+        */

This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.

Yes it was badly worded. I've fixed in the attached.

+   /*
+    * If we found any surplus Vars in the GROUP BY clause, then we'll
build
+    * a new GROUP BY clause without these surplus Vars.
+    */
+   if (anysurplus)
+   {
+       List *new_groupby = NIL;
+
+       foreach (lc, root->parse->groupClause)
+       {
+           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+           TargetEntry *tle;
+           Var *var;
+
+           tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+           var = (Var *) tle->expr;
+
+           if (!IsA(var, Var))
+               continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.

Yes you're right, well, at least I've written enough code to ensure that
it's not needed.
Technically we could look inside non-Vars and check if the Expr is just
made up of Vars from a single relation, and then we could also make that
surplus if we find other Vars which make up the table's primary key. I
didn't make these changes as I thought it was a less likely scenario. It
wouldn't be too much extra code to add however. I've went and added an XXX
comment to explain that there might be future optimisation opportunities in
the future around this.

I've attached an updated patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

prune_group_by_clause_90b5f3c_2016-01-14.patchapplication/octet-stream; name=prune_group_by_clause_90b5f3c_2016-01-14.patchDownload
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 28e3aed..50e7765 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -871,6 +871,143 @@ get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok)
 }
 
 /*
+ * identify_key_vars
+ *		Returns a Bitmapset with bits set to mark which 0-based varlist items
+ *		belong to 'relid's PRIMARY KEY.
+ *
+ *	On successful identification of primary key columns 'constraintOid' is set
+ *	to the OID of the primary key constraint. On failure to match to *all*
+ *	primary key columns, NULL is returned and 'constraintOid' is left
+ *	unchanged.
+ */
+Bitmapset *
+identify_key_vars(Oid relid, Index varno, Index varlevelsup, List *varlist,
+				  Oid *constraintOid)
+{
+	Relation	pg_constraint;
+	HeapTuple	tuple;
+	SysScanDesc scan;
+	ScanKeyData skey[1];
+	Bitmapset	*varlistmatches;
+
+	/* Scan pg_constraint for constraints of the target rel */
+	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_constraint_conrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
+							  NULL, 1, skey);
+
+	varlistmatches = NULL;
+
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+		Datum		adatum;
+		bool		isNull;
+		ArrayType  *arr;
+		int16	   *attnums;
+		int			numkeys;
+		int			i;
+
+		/*
+		 * For now we must ignore anything apart from PRIMARY KEY constraints.
+		 * Technically we could look at UNIQUE indexes too, however we'd also
+		 * have to ensure that each column of the unique index had a NOT NULL
+		 * constraint, however since NOT NULL constraints currently are don't
+		 * have a pg_constraint entry this means that we're unable to track the
+		 * dependency on the NOT NULL in order to invalidate any cached query
+		 * plans in the event that someone drops the NOT NULL constraint. In
+		 * any case it seems strange to go to more lengths here than what is
+		 * done by check_functional_grouping().
+		 */
+		if (con->contype != CONSTRAINT_PRIMARY)
+			continue;
+
+		/*
+		 * Constraint must be non-deferrable, but no point in looking for
+		 * another primary key, as there can only be one, so break;
+		 */
+		if (con->condeferrable)
+			break;
+
+		/* Extract the conkey array, ie, attnums of PK's columns */
+		adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+							  RelationGetDescr(pg_constraint), &isNull);
+		if (isNull)
+			elog(ERROR, "null conkey for constraint %u",
+				 HeapTupleGetOid(tuple));
+		arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+		numkeys = ARR_DIMS(arr)[0];
+		if (ARR_NDIM(arr) != 1 ||
+			numkeys < 0 ||
+			ARR_HASNULL(arr) ||
+			ARR_ELEMTYPE(arr) != INT2OID)
+			elog(ERROR, "conkey is not a 1-D smallint array");
+		attnums = (int16 *) ARR_DATA_PTR(arr);
+
+		for (i = 0; i < numkeys; i++)
+		{
+			AttrNumber	attnum = attnums[i];
+			ListCell   *lc;
+			bool found_col = false;
+			int varlistidx = 0;
+
+			foreach(lc, varlist)
+			{
+				Var		   *gvar = (Var *) lfirst(lc);
+
+				if (IsA(gvar, Var) &&
+					gvar->varno == varno &&
+					gvar->varlevelsup == varlevelsup &&
+					gvar->varattno == attnum)
+				{
+					varlistmatches = bms_add_member(varlistmatches, varlistidx);
+					found_col = true;
+					/*
+					 * We can break out of the loop here, as no duplicate Vars
+					 * should exist as these will have been removed during
+					 * parse analysis.
+					 */
+					break;
+				}
+				varlistidx++;
+			}
+
+			if (!found_col)
+			{
+				/*
+				 * No match found for this index key. Since the relation has
+				 * only one PRIMARY KEY we can skip looking through any other
+				 * constraints.
+				 */
+				varlistmatches = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * Did we match all PRIMARY KEY Vars? If so we can return this match.
+		 * There's no need to look for a better match as a relation can only
+		 * have one PRIMARY KEY constraint.
+		 */
+		if (i == numkeys)
+			*constraintOid = HeapTupleGetOid(tuple);
+
+		break;
+	}
+
+	systable_endscan(scan);
+
+	heap_close(pg_constraint, AccessShareLock);
+
+	return varlistmatches;
+}
+
+/*
  * Determine whether a relation can be proven functionally dependent on
  * a set of grouping columns.  If so, return TRUE and add the pg_constraint
  * OIDs of the constraints needed for the proof to the *constraintDeps list.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index fed1acc..c29ef62 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "access/xact.h"
+#include "catalog/pg_constraint.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "foreign/fdwapi.h"
@@ -3160,6 +3161,162 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
+/*
+ * remove_useless_groupby_columns
+ *		Removes columns from the GROUP BY clause which are redundant due to
+ *		being functionally dependant on one or more other column which are
+ *		present.
+ *
+ * We do support non-aggregated columns in the query targetlist which are
+ * functionally dependant on the primary key, but not all relational databases
+ * allow this, so it can be fairly common that queries are written to have the
+ * GROUP BY clause include all of the columns which are not aggregated in the
+ * query's targetlist. These columns have no need to be there as long as the
+ * columns which make up the table's PRIMARY KEY are present in the GROUP BY
+ * clause. Here we test to see if the primary key columns are present for each
+ * relation which make up the columns of the GROUP BY clause, and if so we
+ * remove any which are not part of the primary key.
+ *
+ * In reality we could do more than just the primary key columns here as UNIQUE
+ * indexes with NOT NULL constraints on the columns allow us to also determine
+ * functional dependencies. However we mustn't go any further than what's done
+ * by check_functional_grouping().
+ */
+static void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	List		  **relvarlist;
+	Bitmapset	  **surplusvars;
+	ListCell	   *lc;
+	int				relid;
+	bool			anysurplus = false;
+	Query		   *parse = root->parse;
+
+	if (parse->groupingSets)
+		return;
+
+	/* Nothing do to when there's only 1 group by expression */
+	if (list_length(parse->groupClause) < 2)
+		return;
+
+	relvarlist = (List **) palloc0(sizeof(List *) * (list_length(parse->rtable) + 1));
+	surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) * (list_length(parse->rtable) + 1));
+
+	/*
+	 * Pre-process the GROUP BY clause Vars from each relation into
+	 * 'relvarlist' indexed by the relid
+	 */
+	foreach (lc, parse->groupClause)
+	{
+		SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+		TargetEntry *tle;
+		Var *var;
+
+		tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		var = (Var *) tle->expr;
+
+		/*
+		 * XXX it may be useful to also look at non-Vars here. If the clause
+		 * is an Expr containing Vars from a single relation then we may also
+		 * be able to remove it, providing that we find all PRIMARY KEY columns
+		 * elsewhere in the GROUP BY clause. Perhaps we can improve on this
+		 * later, but for now it seems better to keep it simple.
+		 */
+		if (!IsA(var, Var))
+			continue;
+
+		Assert(var->varno <= list_length(parse->rtable));
+		relvarlist[var->varno] = lappend(relvarlist[var->varno], var);
+	}
+
+	/*
+	 * Look at each rtable relation and check if any Vars were found to belong
+	 * to this relation by the logic above. Note that since identify_key_vars()
+	 * only concerns itself with PRIMARY KEY constraints to prove uniqueness,
+	 * and that since PRIMARY KEYs cannot be based on expressions, then we need
+	 * only concern ourselves here with plain Vars. If there are no Vars then
+	 * nothing need be done for this rel. If there's less than two Vars then
+	 * there's no room to remove any, as the PRIMARY KEY must have at least one
+	 * Var, so we can safely also do nothing if there's less than two Vars.
+	 */
+	relid = 1;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+		List *varlist = relvarlist[relid];
+		Bitmapset *varlistmatches;
+		Oid		  constraintOid;
+		ListCell *lc2;
+
+		/* don't try anything unless there's two Vars */
+		if (varlist == NULL || list_length(varlist) < 2)
+			continue;
+
+		varlistmatches = identify_key_vars(rte->relid, relid, 0, varlist,
+										   &constraintOid);
+
+		/*
+		 * If we have some varlist matches and varlist has additional columns,
+		 * then each of these additional column may be removed from the
+		 * GROUP BY clause. We'll mark the varattno of each variable that's not
+		 * needed.
+		 */
+		if (varlistmatches != NULL &&
+			bms_num_members(varlistmatches) < list_length(varlist))
+		{
+			int varidx = 0;
+			foreach(lc2, varlist)
+			{
+				Var *var = (Var *) lfirst(lc2);
+
+				if (!bms_is_member(varidx, varlistmatches))
+				{
+					surplusvars[relid] = bms_add_member(surplusvars[relid],
+														var->varattno);
+				}
+
+				varidx++;
+			}
+
+			/*
+			 * Mark that the GROUP BY clause needs to be rebuilt. We'll do this
+			 * just once when we've completely processing any other relation's
+			 * columns which are present in the GROUP BY clause.
+			 */
+			anysurplus = true;
+			parse->constraintDeps = lappend_oid(parse->constraintDeps,
+												constraintOid);
+		}
+		relid++;
+	}
+
+	/*
+	 * If we found any surplus Vars in the GROUP BY clause, then we'll build
+	 * a new GROUP BY clause without these surplus Vars.
+	 */
+	if (anysurplus)
+	{
+		List *new_groupby = NIL;
+
+		foreach (lc, root->parse->groupClause)
+		{
+			SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+			TargetEntry *tle;
+			Var *var;
+
+			tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+			var = (Var *) tle->expr;
+
+			if (!IsA(var, Var) ||
+				!bms_is_member(var->varattno, surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+
+			/* XXX do we to alter tleSortGroupRef to remove gaps? */
+		}
+		root->parse->groupClause = new_groupby;
+	}
+}
+
 
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
@@ -3205,6 +3362,12 @@ preprocess_groupclause(PlannerInfo *root, List *force)
 		return new_groupclause;
 	}
 
+	/*
+	 * Remove any GROUP BY clause items which are functionally dependent on
+	 * other clause items
+	 */
+	remove_useless_groupby_columns(root);
+
 	/* If no ORDER BY, nothing useful to do here */
 	if (parse->sortClause == NIL)
 		return parse->groupClause;
diff --git a/src/include/catalog/pg_constraint.h b/src/include/catalog/pg_constraint.h
index 07dbcea..934e8db 100644
--- a/src/include/catalog/pg_constraint.h
+++ b/src/include/catalog/pg_constraint.h
@@ -250,6 +250,9 @@ extern void AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
 extern Oid	get_relation_constraint_oid(Oid relid, const char *conname, bool missing_ok);
 extern Oid	get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok);
 
+extern Bitmapset *identify_key_vars(Oid relid, Index varno, Index varlevelsup,
+									List *varlist, Oid *constraintOid);
+
 extern bool check_functional_grouping(Oid relid,
 						  Index varno, Index varlevelsup,
 						  List *grouping_columns,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 64f046e..1c6c8e6 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3918,19 +3918,19 @@ select d.* from d left join (select distinct * from b) s
 -- not in the join condition
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
-  on d.a = s.id;
-                 QUERY PLAN                  
----------------------------------------------
+  on d.a = s.c_id;
+              QUERY PLAN               
+---------------------------------------
  Merge Left Join
-   Merge Cond: (d.a = s.id)
+   Merge Cond: (d.a = s.c_id)
    ->  Sort
          Sort Key: d.a
          ->  Seq Scan on d
    ->  Sort
-         Sort Key: s.id
+         Sort Key: s.c_id
          ->  Subquery Scan on s
                ->  HashAggregate
-                     Group Key: b.id, b.c_id
+                     Group Key: b.id
                      ->  Seq Scan on b
 (11 rows)
 
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 0358d00..e11e0a5 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1266,7 +1266,7 @@ select d.* from d left join (select distinct * from b) s
 -- not in the join condition
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
-  on d.a = s.id;
+  on d.a = s.c_id;
 
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
#9Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: David Rowley (#8)
Re: Removing Functionally Dependent GROUP BY Columns

On 14/01/2016 01:30, David Rowley wrote:

Many thanks for the thorough review!

And thanks for the patch and fast answer!

On 12 January 2016 at 23:37, Julien Rouhaud <julien.rouhaud@dalibo.com
<mailto:julien.rouhaud@dalibo.com>> wrote:

In identify_key_vars()

+       /* Only PK constraints are of interest for now, see comment
above */
+       if (con->contype != CONSTRAINT_PRIMARY)
+           continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).

I've improved this by explaining it more clearly.

Very nice.

Small typo though:

+		 * Technically we could look at UNIQUE indexes too, however we'd also
+		 * have to ensure that each column of the unique index had a NOT NULL

s/had/has/

+ * constraint, however since NOT NULL constraints currently are don't

s/are //

[...]

+               {
+                   varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+                   found_col = true;
+                   /* don't break, there might be dupicates */
+               }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?

No I missed something :) You are right. I've put a break; here and a
comment to explain why.

ok :)

[...]

+   /*
+    * If we found any surplus Vars in the GROUP BY clause, then
we'll build
+    * a new GROUP BY clause without these surplus Vars.
+    */
+   if (anysurplus)
+   {
+       List *new_groupby = NIL;
+
+       foreach (lc, root->parse->groupClause)
+       {
+           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+           TargetEntry *tle;
+           Var *var;
+
+           tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+           var = (Var *) tle->expr;
+
+           if (!IsA(var, Var))
+               continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.

Yes you're right, well, at least I've written enough code to ensure that
it's not needed.

Oh yes, I realize that now.

Technically we could look inside non-Vars and check if the Expr is just
made up of Vars from a single relation, and then we could also make that
surplus if we find other Vars which make up the table's primary key. I
didn't make these changes as I thought it was a less likely scenario. It
wouldn't be too much extra code to add however. I've went and added an
XXX comment to explain that there might be future optimisation
opportunities in the future around this.

Agreed.

I've attached an updated patch.

+		/* don't try anything unless there's two Vars */
+		if (varlist == NULL || list_length(varlist) < 2)
+			continue;

To be perfectly correct, the comment should say "at least two Vars".

Except the small remaining typos, this patch looks very fine to me. I
flag it as ready for committer.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Geoff Winkless
pgsqladmin@geoff.dj
In reply to: Julien Rouhaud (#9)
Re: Removing Functionally Dependent GROUP BY Columns

On 14 January 2016 at 11:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

+               /* don't try anything unless there's two Vars */
+               if (varlist == NULL || list_length(varlist) < 2)
+                       continue;

To be perfectly correct, the comment should say "at least two Vars".

Apologies for butting in and I appreciate I don't have any ownership
over this codebase or right to suggest any changes, but this just
caught my eye before I could hit "delete".

My mantra tends to be "why, not what" for inline comments; in this
case you can get the same information from the next line of code as
you get from the comment.

Perhaps something like

/* it's clearly impossible to remove duplicates if there are fewer
than two GROUPBY columns */

might be more helpful?

(also sorry if I've misunderstood what it _actually_ does, I just made
an assumption based on reading this thread!)

Geoff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: Geoff Winkless (#10)
Re: Removing Functionally Dependent GROUP BY Columns

On 14/01/2016 14:04, Geoff Winkless wrote:

On 14 January 2016 at 11:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

+               /* don't try anything unless there's two Vars */
+               if (varlist == NULL || list_length(varlist) < 2)
+                       continue;

To be perfectly correct, the comment should say "at least two Vars".

Apologies for butting in and I appreciate I don't have any ownership
over this codebase or right to suggest any changes, but this just
caught my eye before I could hit "delete".

My mantra tends to be "why, not what" for inline comments; in this
case you can get the same information from the next line of code as
you get from the comment.

You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mail:

+	 *[...] If there are no Vars then
+	 * nothing need be done for this rel. If there's less than two Vars then
+	 * there's no room to remove any, as the PRIMARY KEY must have at
least one
+	 * Var, so we can safely also do nothing if there's less than two Vars.

so I assume it's ok to keep it this way.

Perhaps something like

/* it's clearly impossible to remove duplicates if there are fewer
than two GROUPBY columns */

might be more helpful?

(also sorry if I've misunderstood what it _actually_ does, I just made
an assumption based on reading this thread!)

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Geoff Winkless
pgsqladmin@geoff.dj
In reply to: Julien Rouhaud (#11)
Re: Removing Functionally Dependent GROUP BY Columns

On 14 January 2016 at 13:16, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mail

Fair enough, although I have two niggles with that:

a) the second comment could become physically separated from the first
by later additions of extra code, or by refactoring;
b) if you don't need the comment because the explanation for it is
local anyway and the comment tells you nothing that the code doesn't,
why have it at all?

so I assume it's ok to keep it this way.

Of course it's ok to do whatever you decide is best: as I said
previously, I fully appreciate that I have no ownership over any of
the code.

Geoff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: Geoff Winkless (#12)
Re: Removing Functionally Dependent GROUP BY Columns

On 14/01/2016 14:29, Geoff Winkless wrote:

On 14 January 2016 at 13:16, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mail

Fair enough, although I have two niggles with that:

a) the second comment could become physically separated from the first
by later additions of extra code, or by refactoring;
b) if you don't need the comment because the explanation for it is
local anyway and the comment tells you nothing that the code doesn't,
why have it at all?

I agree. If I had to choose, I'd vote for removing it.

so I assume it's ok to keep it this way.

Of course it's ok to do whatever you decide is best: as I said
previously, I fully appreciate that I have no ownership over any of
the code.

Neither do I, I'm just reviewer here just as you :)

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14David Rowley
david.rowley@2ndquadrant.com
In reply to: Julien Rouhaud (#9)
1 attachment(s)
Re: Removing Functionally Dependent GROUP BY Columns

On 15 January 2016 at 00:19, Julien Rouhaud <julien.rouhaud@dalibo.com>
wrote:

+                * Technically we could look at UNIQUE indexes too,
however we'd also
+                * have to ensure that each column of the unique index had
a NOT NULL

s/had/has/

+ * constraint, however since NOT NULL constraints
currently are don't

s/are //

Both fixed. Thanks.

+   /*
+    * If we found any surplus Vars in the GROUP BY clause, then
we'll build
+    * a new GROUP BY clause without these surplus Vars.
+    */
+   if (anysurplus)
+   {
+       List *new_groupby = NIL;
+
+       foreach (lc, root->parse->groupClause)
+       {
+           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+           TargetEntry *tle;
+           Var *var;
+
+           tle = get_sortgroupclause_tle(sgc,

root->parse->targetList);

+           var = (Var *) tle->expr;
+
+           if (!IsA(var, Var))
+               continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.

Yes you're right, well, at least I've written enough code to ensure that
it's not needed.

Oh yes, I realize that now.

I meant to say "I've not written enough code" ...

Technically we could look inside non-Vars and check if the Expr is just
made up of Vars from a single relation, and then we could also make that
surplus if we find other Vars which make up the table's primary key. I
didn't make these changes as I thought it was a less likely scenario. It
wouldn't be too much extra code to add however. I've went and added an
XXX comment to explain that there might be future optimisation
opportunities in the future around this.

Agreed.

I've attached an updated patch.

+               /* don't try anything unless there's two Vars */
+               if (varlist == NULL || list_length(varlist) < 2)
+                       continue;

To be perfectly correct, the comment should say "at least two Vars".

Changed per discussion from you and Geoff

Except the small remaining typos, this patch looks very fine to me. I
flag it as ready for committer.

Many thanks for your followup review.

I've attached an updated patch to address what you highlighted above.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

prune_group_by_clause_59f15cf_2016-01-15.patchapplication/octet-stream; name=prune_group_by_clause_59f15cf_2016-01-15.patchDownload
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 28e3aed..274eaf8 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -871,6 +871,143 @@ get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok)
 }
 
 /*
+ * identify_key_vars
+ *		Returns a Bitmapset with bits set to mark which 0-based varlist items
+ *		belong to 'relid's PRIMARY KEY.
+ *
+ *	On successful identification of primary key columns 'constraintOid' is set
+ *	to the OID of the primary key constraint. On failure to match to *all*
+ *	primary key columns, NULL is returned and 'constraintOid' is left
+ *	unchanged.
+ */
+Bitmapset *
+identify_key_vars(Oid relid, Index varno, Index varlevelsup, List *varlist,
+				  Oid *constraintOid)
+{
+	Relation	pg_constraint;
+	HeapTuple	tuple;
+	SysScanDesc scan;
+	ScanKeyData skey[1];
+	Bitmapset	*varlistmatches;
+
+	/* Scan pg_constraint for constraints of the target rel */
+	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_constraint_conrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
+							  NULL, 1, skey);
+
+	varlistmatches = NULL;
+
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+		Datum		adatum;
+		bool		isNull;
+		ArrayType  *arr;
+		int16	   *attnums;
+		int			numkeys;
+		int			i;
+
+		/*
+		 * For now we must ignore anything apart from PRIMARY KEY constraints.
+		 * Technically we could look at UNIQUE indexes too, however we'd also
+		 * have to ensure that each column of the unique index has a NOT NULL
+		 * constraint, however since NOT NULL constraints currently don't have
+		 * a pg_constraint entry this means that we're unable to track the
+		 * dependency on the NOT NULL in order to invalidate any cached query
+		 * plans in the event that someone drops the NOT NULL constraint. In
+		 * any case it seems strange to go to more lengths here than what is
+		 * done by check_functional_grouping().
+		 */
+		if (con->contype != CONSTRAINT_PRIMARY)
+			continue;
+
+		/*
+		 * Constraint must be non-deferrable, but no point in looking for
+		 * another primary key, as there can only be one, so break;
+		 */
+		if (con->condeferrable)
+			break;
+
+		/* Extract the conkey array, ie, attnums of PK's columns */
+		adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+							  RelationGetDescr(pg_constraint), &isNull);
+		if (isNull)
+			elog(ERROR, "null conkey for constraint %u",
+				 HeapTupleGetOid(tuple));
+		arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+		numkeys = ARR_DIMS(arr)[0];
+		if (ARR_NDIM(arr) != 1 ||
+			numkeys < 0 ||
+			ARR_HASNULL(arr) ||
+			ARR_ELEMTYPE(arr) != INT2OID)
+			elog(ERROR, "conkey is not a 1-D smallint array");
+		attnums = (int16 *) ARR_DATA_PTR(arr);
+
+		for (i = 0; i < numkeys; i++)
+		{
+			AttrNumber	attnum = attnums[i];
+			ListCell   *lc;
+			bool found_col = false;
+			int varlistidx = 0;
+
+			foreach(lc, varlist)
+			{
+				Var		   *gvar = (Var *) lfirst(lc);
+
+				if (IsA(gvar, Var) &&
+					gvar->varno == varno &&
+					gvar->varlevelsup == varlevelsup &&
+					gvar->varattno == attnum)
+				{
+					varlistmatches = bms_add_member(varlistmatches, varlistidx);
+					found_col = true;
+					/*
+					 * We can break out of the loop here, as no duplicate Vars
+					 * should exist as these will have been removed during
+					 * parse analysis.
+					 */
+					break;
+				}
+				varlistidx++;
+			}
+
+			if (!found_col)
+			{
+				/*
+				 * No match found for this index key. Since the relation has
+				 * only one PRIMARY KEY we can skip looking through any other
+				 * constraints.
+				 */
+				varlistmatches = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * Did we match all PRIMARY KEY Vars? If so we can return this match.
+		 * There's no need to look for a better match as a relation can only
+		 * have one PRIMARY KEY constraint.
+		 */
+		if (i == numkeys)
+			*constraintOid = HeapTupleGetOid(tuple);
+
+		break;
+	}
+
+	systable_endscan(scan);
+
+	heap_close(pg_constraint, AccessShareLock);
+
+	return varlistmatches;
+}
+
+/*
  * Determine whether a relation can be proven functionally dependent on
  * a set of grouping columns.  If so, return TRUE and add the pg_constraint
  * OIDs of the constraints needed for the proof to the *constraintDeps list.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index fed1acc..4684ba2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "access/xact.h"
+#include "catalog/pg_constraint.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "foreign/fdwapi.h"
@@ -3160,6 +3161,162 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
+/*
+ * remove_useless_groupby_columns
+ *		Removes columns from the GROUP BY clause which are redundant due to
+ *		being functionally dependant on one or more other column which are
+ *		present.
+ *
+ * We do support non-aggregated columns in the query targetlist which are
+ * functionally dependant on the primary key, but not all relational databases
+ * allow this, so it can be fairly common that queries are written to have the
+ * GROUP BY clause include all of the columns which are not aggregated in the
+ * query's targetlist. These columns have no need to be there as long as the
+ * columns which make up the table's PRIMARY KEY are present in the GROUP BY
+ * clause. Here we test to see if the primary key columns are present for each
+ * relation which make up the columns of the GROUP BY clause, and if so we
+ * remove any which are not part of the primary key.
+ *
+ * In reality we could do more than just the primary key columns here as UNIQUE
+ * indexes with NOT NULL constraints on the columns allow us to also determine
+ * functional dependencies. However we mustn't go any further than what's done
+ * by check_functional_grouping().
+ */
+static void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	List		  **relvarlist;
+	Bitmapset	  **surplusvars;
+	ListCell	   *lc;
+	int				relid;
+	bool			anysurplus = false;
+	Query		   *parse = root->parse;
+
+	if (parse->groupingSets)
+		return;
+
+	/* Nothing do to when there's only 1 group by expression */
+	if (list_length(parse->groupClause) < 2)
+		return;
+
+	relvarlist = (List **) palloc0(sizeof(List *) * (list_length(parse->rtable) + 1));
+	surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) * (list_length(parse->rtable) + 1));
+
+	/*
+	 * Pre-process the GROUP BY clause Vars from each relation into
+	 * 'relvarlist' indexed by the relid
+	 */
+	foreach (lc, parse->groupClause)
+	{
+		SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+		TargetEntry *tle;
+		Var *var;
+
+		tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		var = (Var *) tle->expr;
+
+		/*
+		 * XXX it may be useful to also look at non-Vars here. If the clause
+		 * is an Expr containing Vars from a single relation then we may also
+		 * be able to remove it, providing that we find all PRIMARY KEY columns
+		 * elsewhere in the GROUP BY clause. Perhaps we can improve on this
+		 * later, but for now it seems better to keep it simple.
+		 */
+		if (!IsA(var, Var))
+			continue;
+
+		Assert(var->varno <= list_length(parse->rtable));
+		relvarlist[var->varno] = lappend(relvarlist[var->varno], var);
+	}
+
+	/*
+	 * Look at each rtable relation and check if any Vars were found to belong
+	 * to this relation by the logic above. Note that since identify_key_vars()
+	 * only concerns itself with PRIMARY KEY constraints to prove uniqueness,
+	 * and that since PRIMARY KEYs cannot be based on expressions, then we need
+	 * only concern ourselves here with plain Vars.
+	 */
+	relid = 1;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+		List *varlist = relvarlist[relid];
+		Bitmapset *varlistmatches;
+		Oid		  constraintOid;
+		ListCell *lc2;
+
+		/*
+		 * Clearly there's no room for removing any items from the GROUP BY
+		 * clause there are fewer than two Vars belonging to this rel.
+		 */
+		if (varlist == NULL || list_length(varlist) < 2)
+			continue;
+
+		varlistmatches = identify_key_vars(rte->relid, relid, 0, varlist,
+										   &constraintOid);
+
+		/*
+		 * If we have some varlist matches and varlist has additional columns,
+		 * then each of these additional column may be removed from the
+		 * GROUP BY clause. We'll mark the varattno of each variable that's not
+		 * needed.
+		 */
+		if (varlistmatches != NULL &&
+			bms_num_members(varlistmatches) < list_length(varlist))
+		{
+			int varidx = 0;
+			foreach(lc2, varlist)
+			{
+				Var *var = (Var *) lfirst(lc2);
+
+				if (!bms_is_member(varidx, varlistmatches))
+				{
+					surplusvars[relid] = bms_add_member(surplusvars[relid],
+														var->varattno);
+				}
+
+				varidx++;
+			}
+
+			/*
+			 * Mark that the GROUP BY clause needs to be rebuilt. We'll do this
+			 * just once when we've completely processing any other relation's
+			 * columns which are present in the GROUP BY clause.
+			 */
+			anysurplus = true;
+			parse->constraintDeps = lappend_oid(parse->constraintDeps,
+												constraintOid);
+		}
+		relid++;
+	}
+
+	/*
+	 * If we found any surplus Vars in the GROUP BY clause, then we'll build
+	 * a new GROUP BY clause without these surplus Vars.
+	 */
+	if (anysurplus)
+	{
+		List *new_groupby = NIL;
+
+		foreach (lc, root->parse->groupClause)
+		{
+			SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+			TargetEntry *tle;
+			Var *var;
+
+			tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+			var = (Var *) tle->expr;
+
+			if (!IsA(var, Var) ||
+				!bms_is_member(var->varattno, surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+
+			/* XXX do we to alter tleSortGroupRef to remove gaps? */
+		}
+		root->parse->groupClause = new_groupby;
+	}
+}
+
 
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
@@ -3205,6 +3362,12 @@ preprocess_groupclause(PlannerInfo *root, List *force)
 		return new_groupclause;
 	}
 
+	/*
+	 * Remove any GROUP BY clause items which are functionally dependent on
+	 * other clause items
+	 */
+	remove_useless_groupby_columns(root);
+
 	/* If no ORDER BY, nothing useful to do here */
 	if (parse->sortClause == NIL)
 		return parse->groupClause;
diff --git a/src/include/catalog/pg_constraint.h b/src/include/catalog/pg_constraint.h
index 07dbcea..934e8db 100644
--- a/src/include/catalog/pg_constraint.h
+++ b/src/include/catalog/pg_constraint.h
@@ -250,6 +250,9 @@ extern void AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
 extern Oid	get_relation_constraint_oid(Oid relid, const char *conname, bool missing_ok);
 extern Oid	get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok);
 
+extern Bitmapset *identify_key_vars(Oid relid, Index varno, Index varlevelsup,
+									List *varlist, Oid *constraintOid);
+
 extern bool check_functional_grouping(Oid relid,
 						  Index varno, Index varlevelsup,
 						  List *grouping_columns,
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 64f046e..1c6c8e6 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3918,19 +3918,19 @@ select d.* from d left join (select distinct * from b) s
 -- not in the join condition
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
-  on d.a = s.id;
-                 QUERY PLAN                  
----------------------------------------------
+  on d.a = s.c_id;
+              QUERY PLAN               
+---------------------------------------
  Merge Left Join
-   Merge Cond: (d.a = s.id)
+   Merge Cond: (d.a = s.c_id)
    ->  Sort
          Sort Key: d.a
          ->  Seq Scan on d
    ->  Sort
-         Sort Key: s.id
+         Sort Key: s.c_id
          ->  Subquery Scan on s
                ->  HashAggregate
-                     Group Key: b.id, b.c_id
+                     Group Key: b.id
                      ->  Seq Scan on b
 (11 rows)
 
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 0358d00..e11e0a5 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1266,7 +1266,7 @@ select d.* from d left join (select distinct * from b) s
 -- not in the join condition
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
-  on d.a = s.id;
+  on d.a = s.c_id;
 
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
#15Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: David Rowley (#14)
Re: Removing Functionally Dependent GROUP BY Columns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 14/01/2016 23:05, David Rowley wrote:

On 15 January 2016 at 00:19, Julien Rouhaud
<julien.rouhaud@dalibo.com <mailto:julien.rouhaud@dalibo.com>>
wrote:

+ * Technically we could look at UNIQUE indexes
too, however we'd also + * have to ensure that each
column of the unique index had a NOT NULL

s/had/has/

+ * constraint, however since NOT NULL constraints
currently are don't

s/are //

Both fixed. Thanks.

+ /* + * If we found any surplus Vars in the GROUP BY
clause, then we'll build + * a new GROUP BY clause without
these surplus Vars. + */ + if (anysurplus) + { +
List *new_groupby = NIL; + + foreach (lc,
root->parse->groupClause) + { + SortGroupClause
*sgc = (SortGroupClause *) lfirst(lc); + TargetEntry
*tle; + Var *var; + + tle =
get_sortgroupclause_tle(sgc, root->parse->targetList); +
var = (Var *) tle->expr; + + if (!IsA(var, Var)) +
continue; + [...]

if var isn't a Var, it needs to be added in new_groupby.

Yes you're right, well, at least I've written enough code to
ensure that it's not needed.

Oh yes, I realize that now.

I meant to say "I've not written enough code" ...

Yes, that makes sense with the explanation you wrote just after :)

Technically we could look inside non-Vars and check if the Expr
is just made up of Vars from a single relation, and then we could
also make that surplus if we find other Vars which make up the
table's primary key. I didn't make these changes as I thought it
was a less likely scenario. It wouldn't be too much extra code to
add however. I've went and added an XXX comment to explain that
there might be future optimisation opportunities in the future
around this.

Agreed.

I've attached an updated patch.

+ /* don't try anything unless there's two Vars */ +
if (varlist == NULL || list_length(varlist) < 2) +
continue;

To be perfectly correct, the comment should say "at least two
Vars".

Changed per discussion from you and Geoff

Except the small remaining typos, this patch looks very fine to me.
I flag it as ready for committer.

Many thanks for your followup review.

I've attached an updated patch to address what you highlighted
above.

Thanks!

-- David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

- --
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJWmCoYAAoJELGaJ8vfEpOqJzIIALajMHHd8aCDnAuH9uNUyevU
EuKyHTRDJkc8KUNbtDeSpVf9UGT3XUaZx4k/o+aKDdRB/RfYK0GKyDv2Owr4Wx3F
5BY9GuEO3vjqaFuDBH5u9EjySal3jQYC57nB3I0hRvpVRQBi0nFyQre/SXplCB6q
X1NqBfICyu6lwwocAMCeW9qN4ZQclLjxoScJpA4ml9Kj6CQvK2dDSyS00gstLFPH
Bj+n20wEcC7ZyxCpxfHmoZW1sjAvZK5mwrEdFG+lvCO8OwN/73YvDFzQHrpvVXWE
ZVoz069kfekZtXQ1OQ5CcvOAJLD9ewbPq+rpKh9YQorZB1R9QEj0qaxqo7LtjhA=
=G/uH
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#14)
Re: Removing Functionally Dependent GROUP BY Columns

David Rowley <david.rowley@2ndquadrant.com> writes:

[ prune_group_by_clause_59f15cf_2016-01-15.patch ]

I spent some time looking through this but decided that it needs some work
to be committable.

The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list. I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable). The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().

The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say). I do not think
we need to cache it right now but I'd like to have the option to do so.

(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)

Some lesser points:

* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring. I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.

* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too. Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test. You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.

* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.

I'm going to set this back to "Waiting on Author". I think the basic
idea is good but it needs a rewrite.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17David Rowley
david.rowley@2ndquadrant.com
In reply to: Tom Lane (#16)
2 attachment(s)
Re: Removing Functionally Dependent GROUP BY Columns

On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I spent some time looking through this but decided that it needs some work
to be committable.

The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list. I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable). The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().

That seems like a much better design to me too. I've made changes and
attached the updated patch.

The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say). I do not think
we need to cache it right now but I'd like to have the option to do so.

I've not touched that yet, but good idea.

(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)

I've attached a separate patch for that too. It applies after the
prunegroupby patch.

Some lesser points:

* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring. I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.

I've moved the call to subquery_planner()

* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.

:-( I've fixed that now.

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.

* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too. Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test. You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.

I've changed to use a bitmapset now, but I've kept surplusvars, having
this allows a single pass over the group by clause to remove the
surplus Vars, rather than doing it again and again for each relation.
I've also found a better way so that array is only allocated if some
surplus Vars are found. I understand what you mean, and yes, it is
possible to get rid of it, but I'd need to still add something else to
mark that this rel's extra Vars are okay to be removed. I could do
that by adding another bitmapset and marking the relid in that, but I
quite like how I've changed it now as it saves having to check
varlevelsup again on the Vars in the GROUP BY clause, as I just use
bms_difference() on the originally found Vars subtracting off the PK
vars, and assume all of those are surplus.

* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.

The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.

I'm going to set this back to "Waiting on Author". I think the basic
idea is good but it needs a rewrite.

Thanks for the review and the good ideas.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

prune_group_by_clause_ab4f491_2016-01-23.patchapplication/octet-stream; name=prune_group_by_clause_ab4f491_2016-01-23.patchDownload
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 28e3aed..3e70d07 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
+#include "access/sysattr.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
@@ -871,6 +872,90 @@ get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok)
 }
 
 /*
+ * get_primary_key_attnos
+ *		Returns a Bitmapset of each Var's varattno offset by
+ *		FirstLowInvalidHeapAttributeNumber which makes up the PRIMARY KEY of
+ *		'relid'. If no PRIMARY KEY exists on 'relid' then NULL is returned.
+ *
+ * When 'deferrableOk' is false we will return NULL if the PRIMARY KEY
+ * constraint is set to deferrable.
+ * 'constraintOid' will be set to the OID of the PRIMARY KEY constraint,
+ * although this is only set when the function does not return NULL.
+ */
+Bitmapset *
+get_primary_key_attnos(Oid relid, bool deferrableOk, Oid *constraintOid)
+{
+	Relation		pg_constraint;
+	HeapTuple		tuple;
+	SysScanDesc		scan;
+	ScanKeyData		skey[1];
+	Bitmapset	   *pkattnos = NULL;
+
+	/* Scan pg_constraint for constraints of the target rel */
+	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
+
+	ScanKeyInit(&skey[0],
+				Anum_pg_constraint_conrelid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(relid));
+
+	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
+							  NULL, 1, skey);
+
+	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	{
+		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+		Datum		adatum;
+		bool		isNull;
+		ArrayType  *arr;
+		int16	   *attnums;
+		int			numkeys;
+		int			i;
+
+		/* Skip constraints which are not PRIMARY KEYs */
+		if (con->contype != CONSTRAINT_PRIMARY)
+			continue;
+
+		/*
+		 * If the primary key is deferrable, but we've been instructed to
+		 * ignore deferrable constraints, then we'll simply break and exit
+		 * returning NULL. There can only be a single primary key on a table,
+		 * so it's senseless to look for another.
+		 */
+		if (con->condeferrable && !deferrableOk)
+			break;
+
+		/* Extract the conkey array, ie, attnums of PK's columns */
+		adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
+							  RelationGetDescr(pg_constraint), &isNull);
+		if (isNull)
+			elog(ERROR, "null conkey for constraint %u",
+				 HeapTupleGetOid(tuple));
+		arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
+		numkeys = ARR_DIMS(arr)[0];
+		if (ARR_NDIM(arr) != 1 ||
+			numkeys < 0 ||
+			ARR_HASNULL(arr) ||
+			ARR_ELEMTYPE(arr) != INT2OID)
+			elog(ERROR, "conkey is not a 1-D smallint array");
+		attnums = (int16 *) ARR_DATA_PTR(arr);
+
+		for (i = 0; i < numkeys; i++)
+			pkattnos = bms_add_member(pkattnos,
+							attnums[i] - FirstLowInvalidHeapAttributeNumber);
+
+		*constraintOid = HeapTupleGetOid(tuple);
+		break;
+	}
+
+	systable_endscan(scan);
+
+	heap_close(pg_constraint, AccessShareLock);
+
+	return pkattnos;
+}
+
+/*
  * Determine whether a relation can be proven functionally dependent on
  * a set of grouping columns.  If so, return TRUE and add the pg_constraint
  * OIDs of the constraints needed for the proof to the *constraintDeps list.
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index c0ec905..b32a369 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -20,7 +20,9 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "access/sysattr.h"
 #include "access/xact.h"
+#include "catalog/pg_constraint.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "foreign/fdwapi.h"
@@ -87,6 +89,7 @@ static double preprocess_limit(PlannerInfo *root,
 				 double tuple_fraction,
 				 int64 *offset_est, int64 *count_est);
 static bool limit_needed(Query *parse);
+static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
 static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
@@ -664,6 +667,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	}
 	parse->havingQual = (Node *) newHaving;
 
+	/* remove any dedundant GROUP BY columns */
+	remove_useless_groupby_columns(root);
+
 	/*
 	 * If we have any outer joins, try to reduce them to plain inner joins.
 	 * This step is most easily done after we've done expression
@@ -3160,6 +3166,176 @@ limit_needed(Query *parse)
 	return false;				/* don't need a Limit plan node */
 }
 
+/*
+ * remove_useless_groupby_columns
+ *		Removes columns from the GROUP BY clause which are redundant due to
+ *		being functionally dependant on one or more other column which are
+ *		present.
+ *
+ * We do support non-aggregated column in the query's targetlist which are not
+ * present in the GROUP BY clause, providing these columns can be determined to
+ * be functionally dependant on the relation's primary key constraint. There's
+ * quite a few relational databases which don't allow this, so this means it
+ * can be quite common that queries are written to have the GROUP BY clause
+ * include all of the columns which are not aggregated in the query's
+ * targetlist. These columns have no need to be there as long as the columns
+ * which make up the table's PRIMARY KEY are present in the GROUP BY clause.
+ * Here we test to see if the primary key columns are present for each relation
+ * which make up the columns of the GROUP BY clause, and if so we remove any
+ * that belong to the same relation which are not part of the primary key.
+ *
+ * In reality we could do more than just the primary key columns here as UNIQUE
+ * indexes with NOT NULL constraints on the columns allow us to also determine
+ * functional dependencies. However we mustn't go any further than what's done
+ * by check_functional_grouping().
+ */
+static void
+remove_useless_groupby_columns(PlannerInfo *root)
+{
+	Bitmapset	  **groupbyattnos;
+	Bitmapset	  **surplusvars;
+	ListCell	   *lc;
+	int				relid;
+	Query		   *parse = root->parse;
+
+	/* Nothing do to when there are fewer than two group by expression */
+	if (list_length(parse->groupClause) < 2)
+		return;
+
+	/* Don't fiddle with the GROUP BY clause if the query has grouping sets */
+	if (parse->groupingSets)
+		return;
+
+	groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+										   (list_length(parse->rtable) + 1));
+	surplusvars = NULL;			/* don't allocate unless required */
+
+	/*
+	 * Pre-process the GROUP BY clause and populate groupbyattnos with
+	 * varattnos of Vars which belong to the clause.
+	 */
+	foreach (lc, parse->groupClause)
+	{
+		SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+		TargetEntry *tle;
+		Var *var;
+
+		tle = get_sortgroupclause_tle(sgc, parse->targetList);
+		var = (Var *) tle->expr;
+
+		/*
+		 * Ignore non-Vars or Vars from a higher level.
+		 * XXX it may be useful to put more effort into removing non-Vars here.
+		 * If the clause is an Expr containing Vars from a single relation then
+		 * we may also be able to remove it, providing that we find all PRIMARY
+		 * KEY columns elsewhere in the GROUP BY clause. Perhaps we can improve
+		 * on this later, but for now it seems better to keep it simple.
+		 */
+		if (!IsA(var, Var) || var->varlevelsup > 0)
+			continue;
+
+		relid = var->varno;
+
+		Assert(relid <= list_length(parse->rtable));
+		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
+						var->varattno - FirstLowInvalidHeapAttributeNumber);
+	}
+
+	/*
+	 * Look at each relation and check if any of the Vars which were found in
+	 * the GROUP BY clause are not needed due to all the PK constraint Vars
+	 * being present in the clause. We'll take note of any surplus Vars which
+	 * we find rather than removing them from the GROUP BY clause right away.
+	 * This allows us to make a single pass over the GROUP BY clause, rather
+	 * than one pass per relation.
+	 */
+	relid = 0;
+	foreach(lc, parse->rtable)
+	{
+		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
+		Bitmapset *relattnos;
+		Bitmapset *pkattnos;
+		Oid		  constraintOid;
+
+		relid++;
+
+		if (rte->rtekind != RTE_RELATION)
+			continue;
+
+		relattnos = groupbyattnos[relid];
+
+		/*
+		 * Clearly there's no room for removing Vars from this rel from the
+		 * GROUP BY clause if there were fewer than two Vars found to belong to
+		 * this rel.
+		 */
+		if (bms_num_members(relattnos) < 2)
+			continue;
+
+		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
+
+		/*
+		 * Can't remove any columns for this rel if there is no suitable
+		 * primary key constraint.
+		 */
+		if (pkattnos == NULL)
+			continue;
+
+		/*
+		 * If the pkattnos is a proper subset of relattnos then we have some
+		 * items in the GROUP BY which can be removed.
+		 */
+		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		{
+			parse->constraintDeps = lappend_oid(parse->constraintDeps,
+												constraintOid);
+
+			/*
+			 * Now record which Vars are not required in the GROUP BY. We must
+			 * first allocate space to do this, which we've delayed until here
+			 * to save on unnecessarily allocations where no removals are
+			 * possible.
+			 */
+			if (surplusvars == NULL)
+				surplusvars = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
+											(list_length(parse->rtable) + 1));
+
+			/*
+			 * Mark all Vars which we originally collected above minus the
+			 * primary key Vars as surplus. We'll remove all of these from the
+			 * GROUP BY in one pass, once we've checked all other relations
+			 * which are involved in the GROUP BY clause.
+			 */
+			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+		}
+	}
+
+	/* build a new GROUP BY clause with all of the surplus Vars removed */
+	if (surplusvars != NULL)
+	{
+		List *new_groupby = NIL;
+
+		foreach (lc, root->parse->groupClause)
+		{
+			SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+			TargetEntry *tle;
+			Var *var;
+
+			tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+			var = (Var *) tle->expr;
+
+			/* re-add non-Vars and anything not marked as surplus */
+			if (!IsA(var, Var) ||
+				!bms_is_member(var->varattno -
+				FirstLowInvalidHeapAttributeNumber, surplusvars[var->varno]))
+				new_groupby = lappend(new_groupby, sgc);
+
+			/* XXX do we to alter tleSortGroupRef to remove gaps? */
+		}
+		root->parse->groupClause = new_groupby;
+	}
+}
+
 
 /*
  * preprocess_groupclause - do preparatory work on GROUP BY clause
diff --git a/src/include/catalog/pg_constraint.h b/src/include/catalog/pg_constraint.h
index 07dbcea..bb8983c 100644
--- a/src/include/catalog/pg_constraint.h
+++ b/src/include/catalog/pg_constraint.h
@@ -250,6 +250,9 @@ extern void AlterConstraintNamespaces(Oid ownerId, Oid oldNspId,
 extern Oid	get_relation_constraint_oid(Oid relid, const char *conname, bool missing_ok);
 extern Oid	get_domain_constraint_oid(Oid typid, const char *conname, bool missing_ok);
 
+extern Bitmapset *get_primary_key_attnos(Oid relid, bool deferrableOk,
+										 Oid *constraintOid);
+
 extern bool check_functional_grouping(Oid relid,
 						  Index varno, Index varlevelsup,
 						  List *grouping_columns,
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index de826b5..65f6d4f 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -751,6 +751,69 @@ select max(unique2), generate_series(1,3) as g from tenk1 order by g desc;
  9999 | 1
 (3 rows)
 
+-- check group by clause is processed correctly to remove surplus columns
+create temp table t1 (a int, b int, c int, d int, primary key (a, b));
+create temp table t2 (x int, y int, z int, primary key (x, y));
+create temp table t3 (a int, b int, c int, primary key(a, b) deferrable);
+-- ensure only primary key columns remain in group by
+explain (costs off) select * from t1 group by a,b,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b
+   ->  Seq Scan on t1
+(3 rows)
+
+-- ensure c and d are not removed since the complete PK is not present in the
+-- group by clause
+explain (costs off) select a,c from t1 group by a,c,d;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, c, d
+   ->  Seq Scan on t1
+(3 rows)
+
+-- ensure non-pk columns are removed from group by for all relations.
+explain (costs off) select * from t1
+inner join t2 on t1.a = t2.x and t1.b = t2.y
+group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t2.z;
+                      QUERY PLAN                       
+-------------------------------------------------------
+ Group
+   Group Key: t1.a, t1.b, t2.x, t2.y
+   ->  Merge Join
+         Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
+         ->  Index Scan using t1_pkey on t1
+         ->  Index Scan using t2_pkey on t2
+(6 rows)
+
+-- ensure non-pk columns are only removed from t1.
+explain (costs off) select t1.*,t2.x,t2.z from t1
+inner join t2 on t1.a = t2.x and t1.b = t2.y
+group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
+                      QUERY PLAN                       
+-------------------------------------------------------
+ HashAggregate
+   Group Key: t1.a, t1.b, t2.x, t2.z
+   ->  Merge Join
+         Merge Cond: ((t1.a = t2.x) AND (t1.b = t2.y))
+         ->  Index Scan using t1_pkey on t1
+         ->  Index Scan using t2_pkey on t2
+(6 rows)
+
+-- ensure group by columns are not removed when pk is deferrable
+explain (costs off) select * from t3 group by a,b,c;
+      QUERY PLAN      
+----------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on t3
+(3 rows)
+
+drop table t1;
+drop table t2;
+drop table t3;
 -- try it on an inheritance tree
 create table minmaxtest(f1 int);
 create table minmaxtest1() inherits (minmaxtest);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 64f046e..5bb406c 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -3919,8 +3919,8 @@ select d.* from d left join (select distinct * from b) s
 explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
-                 QUERY PLAN                  
----------------------------------------------
+              QUERY PLAN               
+---------------------------------------
  Merge Left Join
    Merge Cond: (d.a = s.id)
    ->  Sort
@@ -3930,7 +3930,7 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          Sort Key: s.id
          ->  Subquery Scan on s
                ->  HashAggregate
-                     Group Key: b.id, b.c_id
+                     Group Key: b.id
                      ->  Seq Scan on b
 (11 rows)
 
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 8d501dc..d09ff08 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -270,6 +270,35 @@ explain (costs off)
   select max(unique2), generate_series(1,3) as g from tenk1 order by g desc;
 select max(unique2), generate_series(1,3) as g from tenk1 order by g desc;
 
+-- check group by clause is processed correctly to remove surplus columns
+create temp table t1 (a int, b int, c int, d int, primary key (a, b));
+create temp table t2 (x int, y int, z int, primary key (x, y));
+create temp table t3 (a int, b int, c int, primary key(a, b) deferrable);
+
+-- ensure only primary key columns remain in group by
+explain (costs off) select * from t1 group by a,b,c,d;
+
+-- ensure c and d are not removed since the complete PK is not present in the
+-- group by clause
+explain (costs off) select a,c from t1 group by a,c,d;
+
+-- ensure non-pk columns are removed from group by for all relations.
+explain (costs off) select * from t1
+inner join t2 on t1.a = t2.x and t1.b = t2.y
+group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t2.z;
+
+-- ensure non-pk columns are only removed from t1.
+explain (costs off) select t1.*,t2.x,t2.z from t1
+inner join t2 on t1.a = t2.x and t1.b = t2.y
+group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
+
+-- ensure group by columns are not removed when pk is deferrable
+explain (costs off) select * from t3 group by a,b,c;
+
+drop table t1;
+drop table t2;
+drop table t3;
+
 -- try it on an inheritance tree
 create table minmaxtest(f1 int);
 create table minmaxtest1() inherits (minmaxtest);
check_functional_grouping_refactor.patchapplication/octet-stream; name=check_functional_grouping_refactor.patchDownload
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 3e70d07..af160d5 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -976,93 +976,32 @@ check_functional_grouping(Oid relid,
 						  List *grouping_columns,
 						  List **constraintDeps)
 {
-	bool		result = false;
-	Relation	pg_constraint;
-	HeapTuple	tuple;
-	SysScanDesc scan;
-	ScanKeyData skey[1];
+	Bitmapset	   *pkattnos;
+	Bitmapset	   *groupbyattnos = NULL;
+	Oid				constraintOid;
+	ListCell	   *gl;
 
-	/* Scan pg_constraint for constraints of the target rel */
-	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
+	pkattnos = get_primary_key_attnos(relid, false, &constraintOid);
 
-	ScanKeyInit(&skey[0],
-				Anum_pg_constraint_conrelid,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(relid));
+	/* if the rel has no PK, then we can't prove functional dependency */
+	if (pkattnos == NULL)
+		return false;
 
-	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
-							  NULL, 1, skey);
-
-	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+	foreach(gl, grouping_columns)
 	{
-		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
-		Datum		adatum;
-		bool		isNull;
-		ArrayType  *arr;
-		int16	   *attnums;
-		int			numkeys;
-		int			i;
-		bool		found_col;
-
-		/* Only PK constraints are of interest for now, see comment above */
-		if (con->contype != CONSTRAINT_PRIMARY)
-			continue;
-		/* Constraint must be non-deferrable */
-		if (con->condeferrable)
-			continue;
+		Var		   *gvar = (Var *) lfirst(gl);
 
-		/* Extract the conkey array, ie, attnums of PK's columns */
-		adatum = heap_getattr(tuple, Anum_pg_constraint_conkey,
-							  RelationGetDescr(pg_constraint), &isNull);
-		if (isNull)
-			elog(ERROR, "null conkey for constraint %u",
-				 HeapTupleGetOid(tuple));
-		arr = DatumGetArrayTypeP(adatum);		/* ensure not toasted */
-		numkeys = ARR_DIMS(arr)[0];
-		if (ARR_NDIM(arr) != 1 ||
-			numkeys < 0 ||
-			ARR_HASNULL(arr) ||
-			ARR_ELEMTYPE(arr) != INT2OID)
-			elog(ERROR, "conkey is not a 1-D smallint array");
-		attnums = (int16 *) ARR_DATA_PTR(arr);
-
-		found_col = false;
-		for (i = 0; i < numkeys; i++)
-		{
-			AttrNumber	attnum = attnums[i];
-			ListCell   *gl;
-
-			found_col = false;
-			foreach(gl, grouping_columns)
-			{
-				Var		   *gvar = (Var *) lfirst(gl);
-
-				if (IsA(gvar, Var) &&
-					gvar->varno == varno &&
-					gvar->varlevelsup == varlevelsup &&
-					gvar->varattno == attnum)
-				{
-					found_col = true;
-					break;
-				}
-			}
-			if (!found_col)
-				break;
-		}
-
-		if (found_col)
-		{
-			/* The PK is a subset of grouping_columns, so we win */
-			*constraintDeps = lappend_oid(*constraintDeps,
-										  HeapTupleGetOid(tuple));
-			result = true;
-			break;
-		}
+		if (IsA(gvar, Var) &&
+			gvar->varno == varno &&
+			gvar->varlevelsup == varlevelsup)
+			groupbyattnos = bms_add_member(groupbyattnos, gvar->varattno -
+										   FirstLowInvalidHeapAttributeNumber);
 	}
 
-	systable_endscan(scan);
-
-	heap_close(pg_constraint, AccessShareLock);
-
-	return result;
+	if (bms_is_subset(pkattnos, groupbyattnos))
+	{
+		*constraintDeps = lappend_oid(*constraintDeps, constraintOid);
+		return true;
+	}
+	return false;
 }
#18Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: David Rowley (#17)
Re: Removing Functionally Dependent GROUP BY Columns

On 23/01/2016 10:44, David Rowley wrote:

On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I spent some time looking through this but decided that it needs some work
to be committable.

The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list. I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable). The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().

That seems like a much better design to me too. I've made changes and
attached the updated patch.

The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say). I do not think
we need to cache it right now but I'd like to have the option to do so.

I've not touched that yet, but good idea.

(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)

I've attached a separate patch for that too. It applies after the
prunegroupby patch.

This refactoring makes the code much more better, +1 for me.

I also reviewed it, it looks fine. However, the following removed
comment could be kept for clarity:

- /* The PK is a subset of grouping_columns, so we win */

Some lesser points:

* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring. I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.

I've moved the call to subquery_planner()

* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.

:-( I've fixed that now.

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.

* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too. Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test. You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.

I've changed to use a bitmapset now, but I've kept surplusvars, having
this allows a single pass over the group by clause to remove the
surplus Vars, rather than doing it again and again for each relation.
I've also found a better way so that array is only allocated if some
surplus Vars are found. I understand what you mean, and yes, it is
possible to get rid of it, but I'd need to still add something else to
mark that this rel's extra Vars are okay to be removed. I could do
that by adding another bitmapset and marking the relid in that, but I
quite like how I've changed it now as it saves having to check
varlevelsup again on the Vars in the GROUP BY clause, as I just use
bms_difference() on the originally found Vars subtracting off the PK
vars, and assume all of those are surplus.

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+       {

by something like

+       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+           continue;
+
+       if (bms_is_subset(pkattnos, relattnos))
+       {

which may be cheaper.

Otherwise, I couldn't find any issue with this new version of the patch.

* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.

The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.

I'm going to set this back to "Waiting on Author". I think the basic
idea is good but it needs a rewrite.

Thanks for the review and the good ideas.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19David Rowley
david.rowley@2ndquadrant.com
In reply to: Julien Rouhaud (#18)
Re: Removing Functionally Dependent GROUP BY Columns

On 24 January 2016 at 00:56, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+       {

by something like

+       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+           continue;
+
+       if (bms_is_subset(pkattnos, relattnos))
+       {

which may be cheaper.

Thanks for looking over this again.

I actually had that part written quite a few different ways before I
finally decided to use bms_subset_compare. I didn't benchmark, but I
thought 1 function call was better than 2, as I had it as
bms_is_subset(pkattnos, relattnos) && !bms_is_subset(relattnos,
pkattnos), and again with !bms_equal() instead of the 2nd subset test.
I figured 1 function call was better than 2, so finally settled on
bms_subset_compare(). I'm thinking that 3 function calls likely won't
make things better. I can't imagine it's going to cost much either
way, so I doubt it's worth trying to check whats faster. Although the
thing about bms_num_members() is that it's going to loop over each
word in the bitmap no matter what, whereas a subset check can abort
early in some cases.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: David Rowley (#19)
Re: Removing Functionally Dependent GROUP BY Columns

On 23/01/2016 14:51, David Rowley wrote:

On 24 January 2016 at 00:56, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+       {

by something like

+       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+           continue;
+
+       if (bms_is_subset(pkattnos, relattnos))
+       {

which may be cheaper.

Thanks for looking over this again.

I actually had that part written quite a few different ways before I
finally decided to use bms_subset_compare. I didn't benchmark, but I
thought 1 function call was better than 2, as I had it as
bms_is_subset(pkattnos, relattnos) && !bms_is_subset(relattnos,
pkattnos), and again with !bms_equal() instead of the 2nd subset test.
I figured 1 function call was better than 2, so finally settled on
bms_subset_compare(). I'm thinking that 3 function calls likely won't
make things better. I can't imagine it's going to cost much either
way, so I doubt it's worth trying to check whats faster. Although the
thing about bms_num_members() is that it's going to loop over each
word in the bitmap no matter what, whereas a subset check can abort
early in some cases.

FWIW, this code was simplified example. bms_num_members(relattnos) is
already called a few lines before, so it'd be 1 function call against 2
function calls (and a var assignment).

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#18)
Re: Removing Functionally Dependent GROUP BY Columns

Julien Rouhaud <julien.rouhaud@dalibo.com> writes:

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+       {

by something like

+       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+           continue;
+
+       if (bms_is_subset(pkattnos, relattnos))
+       {

which may be cheaper.

FWIW, I really doubt that would be cheaper. The various flavors of
subset comparison are word-at-a-time bitmasking operations, but
bms_num_members has to grovel over individual bits; it's certain to
be more expensive than a subset test.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#17)
Re: Removing Functionally Dependent GROUP BY Columns

David Rowley <david.rowley@2ndquadrant.com> writes:

On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.

The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.

Hmm ... but ... it seems to me that the test as it stands *should* fail
after this patch, because once the non-pkey grouping column is removed
the join removal optimization should apply. I think we should look a bit
more closely at what's happening there.

(IOW, I wasn't so much unhappy with the change to that test case as
that it was being used as the only test case for this new behavior.
I see you added some new, separate test cases, so that's good; but
there's something fishy if the existing case doesn't change behavior.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23David Rowley
david.rowley@2ndquadrant.com
In reply to: Tom Lane (#22)
Re: Removing Functionally Dependent GROUP BY Columns

On 24 January 2016 at 08:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* What you did to join.sql/join.out seems a bit bizarre. The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be. A bit more work on regression test cases
seems indicated.

The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.

Hmm ... but ... it seems to me that the test as it stands *should* fail
after this patch, because once the non-pkey grouping column is removed
the join removal optimization should apply. I think we should look a bit
more closely at what's happening there.

(IOW, I wasn't so much unhappy with the change to that test case as
that it was being used as the only test case for this new behavior.
I see you added some new, separate test cases, so that's good; but
there's something fishy if the existing case doesn't change behavior.)

Thanks for looking at this again.

I've looked into why the join is not removed; since the redundant
GROUP BY columns are removed during planning, and since the outer
query is planned before the sub query, then when the join removal code
checks if the subquery can been removed, the subquery is yet to be
planned, so still contains the 2 GROUP BY items.

Perhaps the useless columns can be removed a bit earlier, perhaps in
parse analysis. I will look into that now.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#23)
Re: Removing Functionally Dependent GROUP BY Columns

David Rowley <david.rowley@2ndquadrant.com> writes:

I've looked into why the join is not removed; since the redundant
GROUP BY columns are removed during planning, and since the outer
query is planned before the sub query, then when the join removal code
checks if the subquery can been removed, the subquery is yet to be
planned, so still contains the 2 GROUP BY items.

Hmm ... but why did it get removed in the earlier patch version, then?

Perhaps the useless columns can be removed a bit earlier, perhaps in
parse analysis. I will look into that now.

No; doing this in parse analysis will be sufficient reason to reject the
patch. That would mean adding a not-semantically-necessary dependency on
the pkey to a query when it is stored as a view. It has to be done at
planning time and no sooner.

It's possible that you could integrate it into some earlier phase of
planning, like prepjointree, but I think that would be messy and likely
not worth it. I don't see any existing query-tree traversal this could
piggyback on, and I doubt we want to add a new pass just for this.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25David Rowley
david.rowley@2ndquadrant.com
In reply to: Tom Lane (#24)
Re: Removing Functionally Dependent GROUP BY Columns

On 25 January 2016 at 10:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

I've looked into why the join is not removed; since the redundant
GROUP BY columns are removed during planning, and since the outer
query is planned before the sub query, then when the join removal code
checks if the subquery can been removed, the subquery is yet to be
planned, so still contains the 2 GROUP BY items.

Hmm ... but why did it get removed in the earlier patch version, then?

I'm not sure now, it was months ago. Perhaps I misremembered and only
altered the test because I mistakenly anticipated it would break.

Perhaps the useless columns can be removed a bit earlier, perhaps in
parse analysis. I will look into that now.

No; doing this in parse analysis will be sufficient reason to reject the
patch. That would mean adding a not-semantically-necessary dependency on
the pkey to a query when it is stored as a view. It has to be done at
planning time and no sooner.

It's possible that you could integrate it into some earlier phase of
planning, like prepjointree, but I think that would be messy and likely
not worth it. I don't see any existing query-tree traversal this could
piggyback on, and I doubt we want to add a new pass just for this.

It seems like a bit of a corner case anyway. Maybe it's fine as is.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#17)
Re: Removing Functionally Dependent GROUP BY Columns

This patch got a good share of review, so it's fair to move it to the
next commitfest. I marked it as ready-for-committer there which seems
to be the right status.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#17)
Re: Removing Functionally Dependent GROUP BY Columns

David Rowley <david.rowley@2ndquadrant.com> writes:

[ prune_group_by_clause_ab4f491_2016-01-23.patch ]
[ check_functional_grouping_refactor.patch ]

I've committed this with mostly-cosmetic revisions (principally, rewriting
a lot of the comments, which seemed on the sloppy side).

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.

Um, AFAICS, you *do* need to check again in the second loop, else you'll
be accessing a surplusvars[] entry that might not exist at all, and in
any case might falsely tell you that you can exclude the outer var from
the new GROUP BY list.

That was the only actual bug I found, though I changed some other stuff.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28David Rowley
david.rowley@2ndquadrant.com
In reply to: Tom Lane (#27)
Re: Removing Functionally Dependent GROUP BY Columns

On 12/02/2016 12:01 am, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

[ prune_group_by_clause_ab4f491_2016-01-23.patch ]
[ check_functional_grouping_refactor.patch ]

I've committed this with mostly-cosmetic revisions (principally, rewriting
a lot of the comments, which seemed on the sloppy side).

Many thanks for committing this and improving the comments.

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.

Um, AFAICS, you *do* need to check again in the second loop, else you'll
be accessing a surplusvars[] entry that might not exist at all, and in
any case might falsely tell you that you can exclude the outer var from
the new GROUP BY list.

I can't quite understand what you're seeing here. As far as I can tell from
reading the code again (on my phone ) the varlevelsup check in the 2nd loop
is not required. Any var with varlevelsup above zero won't make it into the
surplus bitmap for the release as the bms_difference call just removes the
pk vars from the varlevelsup =0 vars. So the bms_ismember should be fine. I
can't see what I'm missing. In either case it test does no harm.

Show quoted text

That was the only actual bug I found, though I changed some other stuff.

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#28)
Re: Removing Functionally Dependent GROUP BY Columns

David Rowley <david.rowley@2ndquadrant.com> writes:

On 12/02/2016 12:01 am, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Um, AFAICS, you *do* need to check again in the second loop, else you'll
be accessing a surplusvars[] entry that might not exist at all, and in
any case might falsely tell you that you can exclude the outer var from
the new GROUP BY list.

I can't quite understand what you're seeing here.

The second loop is iterating through the original GROUP BY list, so it
will see again any outer Vars that were excluded by the first loop.
It needs to exclude them exactly the same, because they are outside
the scope of its data structures. Consider something like (admittedly
pretty silly, but legal SQL)

create table up (u1 int, u2 int, u3 int);
create table down (f1 int primary key, f2 int);

select * from othertable, up
where u1 in (select f2 from down group by f1, f2, up.u3);

up.u3 would have varlevelsup = 1, varno = 2, varattno = 3.
If you don't skip it then the surplusvars[var->varno] access
will be trying to fetch off the end of the surplusvars[] array,
because there is only one RTE in the subquery's rangetable
though there are two in the outer query's rangetable.

When I trace through this example, it manages not to crash because
in point of fact the outer Var has already been replaced by a Param.
But I don't think this code should be assuming that; it's an artifact
of the detailed order of processing in subquery_planner(), and could
easily change in the future, for example if we tried to move the
whole affair to the jointree preprocessing stage as we discussed earlier.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30David Rowley
david.rowley@2ndquadrant.com
In reply to: Tom Lane (#29)
Re: Removing Functionally Dependent GROUP BY Columns

On 14/02/2016 5:11 pm, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

David Rowley <david.rowley@2ndquadrant.com> writes:

On 12/02/2016 12:01 am, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
I can't quite understand what you're seeing here.

The second loop is iterating through the original GROUP BY list, so it
will see again any outer Vars that were excluded by the first loop.
It needs to exclude them exactly the same, because they are outside
the scope of its data structures. Consider something like (admittedly
pretty silly, but legal SQL)

create table up (u1 int, u2 int, u3 int);
create table down (f1 int primary key, f2 int);

select * from othertable, up
where u1 in (select f2 from down group by f1, f2, up.u3);

up.u3 would have varlevelsup = 1, varno = 2, varattno = 3.
If you don't skip it then the surplusvars[var->varno] access
will be trying to fetch off the end of the surplusvars[] array,
because there is only one RTE in the subquery's rangetable
though there are two in the outer query's rangetable.

Thanks for explaining this. Clearly I missed the case of the varno pointing
off the end of the array. Thanks for noticing and fixing.