Remove useless GROUP BY columns considering unique index

Started by Zhang Mingliabout 2 years ago24 messages

zmlpostgres@gmail.com

about 2 years ago

Hi,

This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0]/messages/by-id/CAApHDvrdYa=VhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL+PYMVN8OnQ@mail.gmail.com which points out that
unique index whose columns all have NOT NULL constraints could also take the work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

Ex:

create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));

explain (costs off) select * from t2 group by a,b,c;
QUERY PLAN
----------------------
HashAggregate
Group Key: c
-> Seq Scan on t2

The plan drop column a, b as I did a little more.

For the query, as t2 has primary key (a, b), before this patch, we could drop column c because {a, b} are PK.
And we have an unique index(c) with NOT NULL constraint now, we could drop column {a, b}, just keep {c}.

While we have multiple choices, group by a, b (c is removed by PK) and group by c (a, b are removed by unique not null index)
And I implement it to choose the one with less columns so that we can drop as more columns as possible.
I think it’s good for planner to save some cost like Sort less columns.
There may be better one for some reason like: try to keep PK for planner?
I’m not sure about that and it seems not worth much complex.

The NOT NULL constraint may also be computed from primary keys, ex:
create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(a));
Primary key(a, b) ensure a is NOT NULL and we have a unique index(a), but it will introduce more complex to check if a unique index could be used.
I also doubt it worths doing that..
So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
-- Test primary key beats unique not null index.
explain (costs off) select * from t3 group by a,b,c,d;
QUERY PLAN
----------------------
HashAggregate
Group Key: a, b
-> Seq Scan on t3
(3 rows)

create temp table t4 (a int, b int not null, c int not null, d int not null, primary key (a, b), unique(b, c), unique(d));
-- Test unique not null index with less columns wins.
explain (costs off) select * from t4 group by a,b,c,d;
QUERY PLAN
----------------------
HashAggregate
Group Key: d
-> Seq Scan on t4
(3 rows)

The unique Indices could have overlaps with primary keys and indices themselves.

create temp table t5 (a int not null, b int not null, c int not null, d int not null, unique (a, b), unique(b, c), unique(a, c, d));
-- Test unique not null indices have overlap.
explain (costs off) select * from t5 group by a,b,c,d;
QUERY PLAN
----------------------
HashAggregate
Group Key: a, b
-> Seq Scan on t5
(3 rows)

[0]: /messages/by-id/CAApHDvrdYa=VhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL+PYMVN8OnQ@mail.gmail.com

Zhang Mingli
www.hashdata.xyz

Import Notes

Reply to msg id not found: f389a811-f427-4448-997e-b639343ecdb1@SparkReference msg id not found: f389a811-f427-4448-997e-b639343ecdb1@Spark

jian he

jian.universality@gmail.com

about 2 years ago

In reply to: Zhang Mingli (#1)

Re: Remove useless GROUP BY columns considering unique index

On Fri, Dec 29, 2023 at 11:05 PM Zhang Mingli <zmlpostgres@gmail.com> wrote:

Hi,

This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0] which points out that
unique index whose columns all have NOT NULL constraints could also take the work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

[0]/messages/by-id/CAApHDvrdYa=VhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL+PYMVN8OnQ@mail.gmail.com

cosmetic issues:
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a,
b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null,
primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int
not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null,
d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+
+drop table t1;
+drop table t2;
+drop table t3;
+drop table t4;
+

line `materailzed view` is unnecessary?
you forgot to drop table t5.

create temp table p_t2 (
a int not null,
b int not null,
c int not null,
d int,
unique(a), unique(a,b),unique(a,b,c)
) partition by list(a);
create temp table p_t2_1 partition of p_t2 for values in(1);
create temp table p_t2_2 partition of p_t2 for values in(2);
explain (costs off, verbose) select * from p_t2 group by a,b,c,d;
your patch output:
QUERY PLAN
--------------------------------------------------------------
HashAggregate
Output: p_t2.a, p_t2.b, p_t2.c, p_t2.d
Group Key: p_t2.a
-> Append
-> Seq Scan on pg_temp.p_t2_1
Output: p_t2_1.a, p_t2_1.b, p_t2_1.c, p_t2_1.d
-> Seq Scan on pg_temp.p_t2_2
Output: p_t2_2.a, p_t2_2.b, p_t2_2.c, p_t2_2.d
(8 rows)

so it seems to work with partitioned tables, maybe you should add some
test cases with partition tables.

- * XXX This is only used to create derived tables, so NO INHERIT constraints
- * are always skipped.
+ * XXX When used to create derived tables, set skip_no_inherit tp true,
+ * so that NO INHERIT constraints will be skipped.
  */
 List *
-RelationGetNotNullConstraints(Oid relid, bool cooked)
+RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit)
 {
  List   *notnulls = NIL;
  Relation constrRel;
@@ -815,7 +815,7 @@ RelationGetNotNullConstraints(Oid relid, bool cooked)

  if (conForm->contype != CONSTRAINT_NOTNULL)
  continue;
- if (conForm->connoinherit)
+ if (skip_no_inherit && conForm->connoinherit)
  continue;

I don't think you need to refactor RelationGetNotNullConstraints.
however i found it hard to explain it, (which one is parent, which one
is children is very confusing for me).
so i use the following script to test it:
DROP TABLE ATACC1, ATACC2;
CREATE TABLE ATACC1 (a int);
CREATE TABLE ATACC2 (b int,c int, unique(c)) INHERITS (ATACC1);
ALTER TABLE ATACC1 ADD NOT NULL a;
-- ALTER TABLE ATACC1 ADD NOT NULL a NO INHERIT;
ALTER TABLE ATACC2 ADD unique(a);
explain (costs off, verbose) select * from ATACC2 group by a,b,c;
-------------------------

create table t_0_3 (a int not null, b int not null, c int not null, d
int not null, unique (a, b), unique(b, c) DEFERRABLE, unique(d));
explain (costs off, verbose) select * from t_0_3 group by a,b,c,d;
QUERY PLAN
--------------------------------
HashAggregate
Output: a, b, c, d
Group Key: t_0_3.a, t_0_3.b
-> Seq Scan on public.t_0_3
Output: a, b, c, d
(5 rows)

the above part seems not right, it should be `Group Key: t_0_3.d`
so i changed to:

/* Skip constraints that are not UNIQUE */
+ if (con->contype != CONSTRAINT_UNIQUE)
+ continue;
+
+ /* Skip unique constraints that are condeferred */
+ if (con->condeferrable && con->condeferred)
+ continue;

I guess you probably have noticed, in the function
remove_useless_groupby_columns,
get_primary_key_attnos followed by get_min_unique_not_null_attnos.
Also, get_min_unique_not_null_attnos main code is very similar to
get_primary_key_attnos.

They both do `pg_constraint = table_open(ConstraintRelationId,
AccessShareLock);`
you might need to explain why we must open pg_constraint twice.
either it's cheap to do it or another reason.

If scanning twice is expensive, we may need to refactor get_primary_key_attnos.
get_primary_key_attnos only called in check_functional_grouping,
remove_useless_groupby_columns.

I added the patch to commitfest:
https://commitfest.postgresql.org/46/4751/

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Zhang Mingli (#1)

Re: Remove useless GROUP BY columns considering unique index

On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:

So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

This patch no longer applies. We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon. (Likely it could
just look at pg_attribute.attnotnull instead)

Aside from that, I'm not sure about the logic to prefer removing
functionally dependent columns using the constraint with the least
columns. If you had the PRIMARY KEY on two int columns and a unique
index on a 1MB varlena column, I think using the primary key would be
a better choice to remove functionally dependent columns of. Maybe
it's ok to remove any functionally dependant columns on the primary
key first and try removing functionally dependent columns on unique
constraints and a 2nd step (or maybe only if none can be removed using
the PK?)

Also, why constraints rather than unique indexes? You'd need to check
for expression indexes and likely ignore those due to no ability to
detect NOT NULL for expressions.

Also, looking at the patch to how you've coded
get_min_unique_not_null_attnos(), it looks like you're very optimistic
about that being a constraint that has columns mentioned in the GROUP
BY clause. It looks like it won't work if the UNIQUE constraint with
the least columns gets no mention in the GROUP BY clause. That'll
cause performance regressions from what we have today when the primary
key is mentioned and columns can be removed using that but a UNIQUE
constraint exists which has no corresponding GROUP BY columns and
fewer unique columns than the PK. That's not good and it'll need to be
fixed.

Set to waiting on author.

David

Peter Eisentraut

peter_e@gmx.net

over 1 year ago

In reply to: David Rowley (#3)

Re: Remove useless GROUP BY columns considering unique index

On 12.09.24 03:43, David Rowley wrote:

On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:

So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

This patch no longer applies. We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon.

Work is ongoing to revive the patch that catalogs not-null constraints:
<https://commitfest.postgresql.org/49/5224/>. This patch should
probably wait for that patch at the moment.

(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Peter Eisentraut (#4)

Re: Remove useless GROUP BY columns considering unique index

On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut <peter@eisentraut.org> wrote:

On 12.09.24 03:43, David Rowley wrote:

(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

I'm not seeing any need to record constraint dependencies for this
optimisation. It would be different for detecting functional
dependencies in a view using a unique constraint+not null constraints
for ungrouped columns, but that's not what this is. This is just a
planner optimisation. The plan can be invalidated by a relcache
invalidation, which will happen if someone does ALTER TABLE DROP NOT
NULL.

For reference, see 5b736e9cf.

David

Zhang Mingli

zmlpostgres@gmail.com

over 1 year ago

In reply to: David Rowley (#5)

Re: Remove useless GROUP BY columns considering unique index

Hi, all

I haven't paid attention to this topic in a long time, thanks all for the advices, I will study them then update.

Thanks again.

Zhang Mingli
www.hashdata.xyz

Show quoted text

On Sep 18, 2024 at 15:50 +0800, David Rowley <dgrowleyml@gmail.com>, wrote:

On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut <peter@eisentraut.org> wrote:

On 12.09.24 03:43, David Rowley wrote:

(Likely it could just look at pg_attribute.attnotnull instead)

That won't work because you can't record dependencies on that. (This is
one of the reasons for cataloging not-null constraints as real constraints.)

I'm not seeing any need to record constraint dependencies for this
optimisation. It would be different for detecting functional
dependencies in a view using a unique constraint+not null constraints
for ungrouped columns, but that's not what this is. This is just a
planner optimisation. The plan can be invalidated by a relcache
invalidation, which will happen if someone does ALTER TABLE DROP NOT
NULL.

For reference, see 5b736e9cf.

David

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: David Rowley (#3)

Re: Remove useless GROUP BY columns considering unique index

On Thu, Sep 12, 2024 at 9:44 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sat, 30 Dec 2023 at 04:05, Zhang Mingli <zmlpostgres@gmail.com> wrote:

So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

This patch no longer applies. We no longer catalogue NOT NULL
constraints, which this patch is coded to rely upon. (Likely it could
just look at pg_attribute.attnotnull instead)

Aside from that, I'm not sure about the logic to prefer removing
functionally dependent columns using the constraint with the least
columns. If you had the PRIMARY KEY on two int columns and a unique
index on a 1MB varlena column, I think using the primary key would be
a better choice to remove functionally dependent columns of. Maybe
it's ok to remove any functionally dependant columns on the primary
key first and try removing functionally dependent columns on unique
constraints and a 2nd step (or maybe only if none can be removed using
the PK?)

Let's be conservative.
if the primary key is there, then using it to
remove other useless columns;
if not there. found out the unique-index-not-null columns,
using it to remove other columns in the groupby clause.

Also, why constraints rather than unique indexes? You'd need to check
for expression indexes and likely ignore those due to no ability to
detect NOT NULL for expressions.

some unique indexes don't have pg_constraint entry.
for example:
create table t(a int);
create unique index idx_t on t(a);

overall i come up with the attached patch.
instead of loop through pg_constraint, i loop over pg_index, extract indkey.
multiple indkey can match again groupby expression.
doing some loop, found out the indkey that has a minimum number of columns.
for example:
unique-not-null columns can be
(a,b)
(a,b,c)
since (a,b) only has two columns, using (a,b) to remove other columns
in the groupby expression.

some tests are copied from Zhang Mingli.
Code implementation is quite different.

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: jian he (#7)

Re: Remove useless GROUP BY columns considering unique index

On Fri, Nov 22, 2024 at 9:24 PM jian he <jian.universality@gmail.com> wrote:

overall i come up with the attached patch.

in v2:

create table t1 (a int, b int not null, c int, unique(b, c));
explain(costs off) select count(*) from t1 group by b,c,a;
QUERY PLAN
----------------------
HashAggregate
Group Key: b
-> Seq Scan on t1

so the v2 implementation is wrong.
I overlooked cases like: only part of the unique-key is not-null.
In that situation, we cannot remove useless groupby columns.

v3 attached. in v3:
QUERY PLAN
----------------------
HashAggregate
Group Key: b, c, a
-> Seq Scan on t1

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: jian he (#8)

Re: Remove useless GROUP BY columns considering unique index

On Wed, 27 Nov 2024 at 19:51, jian he <jian.universality@gmail.com> wrote:

v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=37.75..56.25 rows=1850 width=8)
Group Key: b, c
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=33.12..51.62 rows=1850 width=8)
Group Key: b
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

The patch could also use a spell check:

"a input" (an input)
"soure" source??
"GROUOP BY"

David

#10

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: David Rowley (#9)

Re: Remove useless GROUP BY columns considering unique index

On Wed, Nov 27, 2024 at 5:30 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 27 Nov 2024 at 19:51, jian he <jian.universality@gmail.com> wrote:

v3 attached. in v3:

1. I find the following quite strange:

postgres=# create table abcd (a int primary key, b int not null, c int
not null, d int not null, unique(b));
CREATE TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=37.75..56.25 rows=1850 width=8)
Group Key: b, c
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

postgres=# alter table abcd drop constraint abcd_pkey;
ALTER TABLE
postgres=# explain select b,c from abcd group by b,c;
QUERY PLAN
--------------------------------------------------------------
HashAggregate (cost=33.12..51.62 rows=1850 width=8)
Group Key: b
-> Seq Scan on abcd (cost=0.00..28.50 rows=1850 width=8)
(3 rows)

Why does the patch only try to remove GROUP BY columns using unique
indexes when there's no primary key?

fixed. aslo have tests on it.

I don't really see why primary key should be treated specially. Why
does the patch not just unify the logic and collect up all unique
non-expression index, non-partial indexes where all columns are NOT
NULL. You could maybe just add a special case to skip the NOT NULL
checking for indexes with indisprimary set.

now no need to scan pg_constrtraint, just one scan pg_index, collect info.
i aslo removed get_primary_key_attnos.

2. In get_unique_not_null_attnos(), not_null_attnos could be a
Bitmapset with attnums offset by FirstLowInvalidHeapAttributeNumber.
That'll allow you to do a bms_is_member() call rather than a (possibly
expensive) list_member_int() call.

fixed

3. The logic in remove_useless_groupby_columns() looks much more
complex than it needs to be. It would be much more simple to find the
matching unique index with the least number of columns and use that
one. Just have a counter to record the bms_num_members() of the
columns of the best match so far and replace it when you find an index
with fewer columns. Please see adjust_group_pathkeys_for_groupagg()
for inspiration.

now get_unique_not_null_attnos is:
List * get_unique_not_null_attnos(Oid relid, int *pk_pos)

returned pk_pos is pk attnums index (zero based) within the returned list.
current pk_pos no use, but i guess it should be useful.
so we can quickly retrieve pk attnums.

We also need to think about if the unique index with the least columns
is always the best one to use. Perhaps the index with the least
columns is a varlena column and there's some index with, say, two
INT4s. It would be much cheaper to hash a couple of INT4s than some
long varlena column. Maybe it's ok just to leave an XXX comment
explaining that we might want to think about doing that one day.
Alternatively, summing up the pg_statistic.stawidth values could work.
Likely it's better to make the patch work correctly before thinking
about that part.

this part has not been addressed yet.

v4 logic is to choose one with the least number of columns.
if there is more than one with the least number of columns, simply
choose the first one
in the matched list.

i think i addressed most of your concern, now the code seems pretty neat.

#11

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: jian he (#10)

Re: Remove useless GROUP BY columns considering unique index

On Thu, 28 Nov 2024 at 17:39, jian he <jian.universality@gmail.com> wrote:

v4 logic is to choose one with the least number of columns.
if there is more than one with the least number of columns, simply
choose the first one
in the matched list.

The new code inside remove_useless_groupby_columns() is still more
complex than what's needed. There's no need to build a 2nd list with
the matching indexes and loop over it. Just skip the non-matching
indexes and record the best match.

I imagined it would look a bit more like the attached patch. I also
removed the pk_pos column as I didn't see any point in it. You were
not using it. If some future code needs that, it can be added then. I
also fixed up a few comments that needed attention. I didn't look at
the regression tests being added. I think given that there's now no
special case for primary key indexes that we don't need a completely
new set of tests. I think it'll make more sense to adjust some of the
existing tests to use a unique constraint instead of a PK and then
adjust a column's NOT NULL property to check that part of the code is
working correctly.

Another issue with this is that the list of indexes being used is not
sorted by Oid. If you look at RelationGetIndexList() you'll see that
we perform a sort. That gives us more consistency in the planner. I
think this patch should be doing that too, otherwise, you could end up
with a plan change after some operation that changes the order that
the indexes are stored in the pg_index table. It's probably fairly
unlikely, but it is the sort of issue that someone will eventually
discover and report.

David

#12

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: David Rowley (#11)

Re: Remove useless GROUP BY columns considering unique index

On Thu, Nov 28, 2024 at 2:11 PM David Rowley <dgrowleyml@gmail.com> wrote:

I think it'll make more sense to adjust some of the
existing tests to use a unique constraint instead of a PK and then
adjust a column's NOT NULL property to check that part of the code is
working correctly.

looking around, i inserted some tests to indexing.sql,
create_index.sql, aggregates.sql
also added tests for sort by index oid.

Another issue with this is that the list of indexes being used is not
sorted by Oid. If you look at RelationGetIndexList() you'll see that
we perform a sort. That gives us more consistency in the planner. I
think this patch should be doing that too, otherwise, you could end up
with a plan change after some operation that changes the order that
the indexes are stored in the pg_index table. It's probably fairly
unlikely, but it is the sort of issue that someone will eventually
discover and report.

Thanks for the tip!
I have wondered about multiple matches, simply choosing the first one
may have some surprise results.
i didn't know that in some cases we use list_sort to make the result
more deterministic.
When there are multiple matches, we need a determined way to choose which one.
so please check attached.

#13

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: jian he (#12)

Re: Remove useless GROUP BY columns considering unique index

minor changes in get_unique_not_null_attnos:

* cosmetic changes
* if the returned list length is less than 2, no need sort.

#14

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: jian he (#12)

Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 01:33, jian he <jian.universality@gmail.com> wrote:

On Thu, Nov 28, 2024 at 2:11 PM David Rowley <dgrowleyml@gmail.com> wrote:

I think it'll make more sense to adjust some of the
existing tests to use a unique constraint instead of a PK and then
adjust a column's NOT NULL property to check that part of the code is
working correctly.

looking around, i inserted some tests to indexing.sql,
create_index.sql, aggregates.sql
also added tests for sort by index oid.

I meant the tests added by d4c3a156c. I don't think there's any need
to have tests in more than one file.

Another issue with this is that the list of indexes being used is not
sorted by Oid.

When there are multiple matches, we need a determined way to choose which one.
so please check attached.

Looking closer at get_relation_info(), it looks like the
RelOptInfo.indexlist will be in descending Oid order, per:

/*
* We've historically used lcons() here. It'd make more sense to
* use lappend(), but that causes the planner to change behavior
* in cases where two indexes seem equally attractive. For now,
* stick with lcons() --- few tables should have so many indexes
* that the O(N^2) behavior of lcons() is really a problem.
*/
indexinfos = lcons(info, indexinfos);

I'm starting to feel like we might be trying to include too much logic
to mimic this behaviour. The main problem is that the RelOptInfos
have not yet been created for the given relation when
remove_useless_groupby_columns() is called. That's why we're having to
query the catalogue tables like this. I do see a comment in
preprocess_groupclauses() that mentions:

* Note: we return a fresh List, but its elements are the same
* SortGroupClauses appearing in parse->groupClause. This is important
* because later processing may modify the processed_groupClause list.

So there's already a warning that there may still be future changes to
the processed_groupClause. It'd be nice if we could delay calling
remove_useless_groupby_columns() until the RelOptInfos are built as
we'd not have to query the catalogue tables to make this work. The
soonest point is just after the call to add_base_rels_to_query() in
query_planner(). That would mean doing more work on the
processed_groupClause in query_planner() rather than in
grouping_planner(), which is a little strange. It looks like the first
time we look at processed_groupClause and make a decision based on the
actual value of it is in standard_qp_callback(), i.e. after
add_base_rels_to_query().

I've attached an updated patch that gets rid of the
get_unique_not_null_attnos() function completely and uses the
RelOptInfo.indexlist and RelOptInfo.notnullattnums. I also realised
that there's no need to check for NOT NULL constraints with unique
indexes defined with NULLS NOT DISTINCT as having NULLs unique means
we can still determine the functional dependency. One other possibly
good outcome of this is that it's much easier to do the
pg_statistic.avgwidth thing I talked about as RelOptInfo stores those
in attr_widths, however, looking at
table_block_relation_estimate_size(), I see we might not always
populate those.

I ended up re-homing remove_useless_groupby_columns() in initsplan.c.
I couldn't see anywhere else it fitted.

David

#15

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: David Rowley (#14)

Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 15:02, David Rowley <dgrowleyml@gmail.com> wrote:

I've attached an updated patch that gets rid of the
get_unique_not_null_attnos() function completely and uses the
RelOptInfo.indexlist and RelOptInfo.notnullattnums.

I forgot to do a local commit before sending v8. Fixed in the attached v9.

David

#16

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: David Rowley (#15)

Re: Remove useless GROUP BY columns considering unique index

On 11/29/24 09:39, David Rowley wrote:

On Fri, 29 Nov 2024 at 15:02, David Rowley <dgrowleyml@gmail.com> wrote:

I've attached an updated patch that gets rid of the
get_unique_not_null_attnos() function completely and uses the
RelOptInfo.indexlist and RelOptInfo.notnullattnums.

I forgot to do a local commit before sending v8. Fixed in the attached v9.

1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the
second commit? It would be easier to review the differences.

--
regards, Andrei Lepikhov

#17

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Andrei Lepikhov (#16)

Re: Remove useless GROUP BY columns considering unique index

On Fri, Nov 29, 2024 at 2:36 PM Andrei Lepikhov <lepihov@gmail.com> wrote:

I forgot to do a local commit before sending v8. Fixed in the attached v9.

1. Thread reference in the patch comment doesn't work.

fixed.

I found that unique expression index can also be used for groupby
column removal.
so I implemented it, aslo added two tests on it.

now remove_useless_groupby_columns like:
if (index->indexprs == NIL)
{
--handle unique index
}
else
{
--handle unique expression index
}

what do you think?

#18

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Andrei Lepikhov (#16)

Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 19:36, Andrei Lepikhov <lepihov@gmail.com> wrote:

1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the
second commit? It would be easier to review the differences.

For #1, I rewrote the commit message.

I've split into two patches for ease of review. See attached.

David

#19

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: jian he (#17)

Re: Remove useless GROUP BY columns considering unique index

On Fri, 29 Nov 2024 at 20:04, jian he <jian.universality@gmail.com> wrote:

I found that unique expression index can also be used for groupby
column removal.
so I implemented it, aslo added two tests on it.

what do you think?

I think it's likely just not common enough to be worthwhile, plus,
unfortunately, I don't think this optimisation is possible with what
we have today. Also, if it were possible you'd need to check the GROUP
BY expressions match the indexed expressions. You've not done that.

The reason I don't think is possible is that we have no infrastructure
that allows us to tag functions or operators so that non-null input(s)
mean non-null outputs. We only have strict, which means null input
means null output. That's the opposite of what we'd need. It might
only be possible with a NULLS NOT DISTINCT index.

David

#20

Andrei Lepikhov

lepihov@gmail.com

over 1 year ago

In reply to: David Rowley (#18)

Re: Remove useless GROUP BY columns considering unique index

On 29/11/2024 14:28, David Rowley wrote:

On Fri, 29 Nov 2024 at 19:36, Andrei Lepikhov <lepihov@gmail.com> wrote:

1. Thread reference in the patch comment doesn't work.
2. May it be possible to move remove_useless_groupby_columns in the
second commit? It would be easier to review the differences.

For #1, I rewrote the commit message.

I've split into two patches for ease of review. See attached.

Thanks!
Patch 0001 seems OK. The differentiation of 'planmain' and 'initsplan'
is not clear for me, but code movement seems correct. I should say that
comment in initsplan.c:
'... initialization routines'
looks strange. Is this module about initialisation now?

Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
been found.
It may restrict future optimisations like [1]/messages/by-id/50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com, where the upper query
block can employ the 'uniqueness' of grouped clauses in selectivity
estimations. But I think it is OK for now.

NOTES:
1. Uniqueness is proved by a UNIQUE index. It might be a bit more
sophisticated to probe not only grouping qual but also employ
equivalence classes. In that case, we could process something like that:

CREATE TABLE test (
x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
CREATE UNIQUE INDEX ON test(y,z);
SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
GROUP BY t1.y,t2.z,t2.w;

2. The same idea could be used with DISTINCT statement. Right now we have:
SELECT DISTINCT y,z,w FROM test;

HashAggregate
Output: y, z, w
Group Key: test.y, test.z, test.w
-> Seq Scan on public.test
Output: x, y, z, w

[1]: /messages/by-id/50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com
/messages/by-id/50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com

--
regards, Andrei Lepikhov

#21

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: David Rowley (#19)

#22

David Rowley

dgrowleyml@gmail.com