Removing redundant grouping columns

Started by Tom Laneover 3 years ago13 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

This patch is aimed at being smarter about cases where we have
redundant GROUP BY entries, for example

SELECT ... WHERE a.x = b.y GROUP BY a.x, b.y;

It's clearly not necessary to perform grouping using both columns.
Grouping by either one alone would produce the same results,
assuming compatible equality semantics. I'm not sure how often
such cases arise in the wild; but we have about ten of them in our
regression tests, which makes me think it's worth the trouble to
de-duplicate as long as it doesn't cost too much. And it doesn't,
because PathKey construction already detects exactly this sort of
redundancy. We need only do something with the knowledge.

We can't simply make the planner replace parse->groupClause with
a shortened list of non-redundant columns, because it's possible
that we prove all the columns redundant, as in

SELECT ... WHERE a.x = 1 GROUP BY a.x;

If we make parse->groupClause empty then some subsequent tests
will think no grouping was requested, leading to incorrect results.
So what I've done in the attached is to invent a new PlannerInfo
field processed_groupClause to hold the shortened list, and then run
around and use that instead of parse->groupClause where appropriate.

(Another way could be to invent a bool hasGrouping to remember whether
groupClause was initially nonempty, analogously to hasHavingQual.
I rejected that idea after finding that there were still a few
places where it's advantageous to use the original full list.)

Beyond that, there's not too much to this patch. I had to fix
nodeAgg.c to not crash when grouping on zero columns, and I spent
some effort on refactoring the grouping-clause preprocessing
logic in planner.c because it seemed to me to have gotten rather
unintelligible. I didn't add any new test cases, because the changes
in existing results seem to sufficiently prove that it works.

I'll stick this in the January CF.

regards, tom lane

Attachments:

remove-redundant-GROUP-BY-1.patchtext/x-diff; charset=us-ascii; name=remove-redundant-GROUP-BY-1.patchDownload+255-149
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Removing redundant grouping columns

I wrote:

This patch is aimed at being smarter about cases where we have
redundant GROUP BY entries, for example
SELECT ... WHERE a.x = b.y GROUP BY a.x, b.y;

The cfbot didn't like this, because of a variable that wasn't
used in non-assert builds. Fixed in v2.

regards, tom lane

Attachments:

remove-redundant-GROUP-BY-2.patchtext/x-diff; charset=us-ascii; name=remove-redundant-GROUP-BY-2.patchDownload+256-151
#3Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#1)
Re: Removing redundant grouping columns

On Wed, Dec 28, 2022 at 6:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This patch is aimed at being smarter about cases where we have
redundant GROUP BY entries, for example

SELECT ... WHERE a.x = b.y GROUP BY a.x, b.y;

It's clearly not necessary to perform grouping using both columns.
Grouping by either one alone would produce the same results,
assuming compatible equality semantics. I'm not sure how often
such cases arise in the wild; but we have about ten of them in our
regression tests, which makes me think it's worth the trouble to
de-duplicate as long as it doesn't cost too much. And it doesn't,
because PathKey construction already detects exactly this sort of
redundancy. We need only do something with the knowledge.

While we are here, I wonder if we can do the same trick for
distinctClause, to cope with cases like

select distinct a.x, b.y from a, b where a.x = b.y;

And there is case from regression test 'select_distinct.sql' that can
benefit from this optimization.

--
-- Check mentioning same column more than once
--

EXPLAIN (VERBOSE, COSTS OFF)
SELECT count(*) FROM
(SELECT DISTINCT two, four, two FROM tenk1) ss;

Thanks
Richard

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#3)
Re: Removing redundant grouping columns

Richard Guo <guofenglinux@gmail.com> writes:

On Wed, Dec 28, 2022 at 6:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This patch is aimed at being smarter about cases where we have
redundant GROUP BY entries, for example
SELECT ... WHERE a.x = b.y GROUP BY a.x, b.y;

While we are here, I wonder if we can do the same trick for
distinctClause, to cope with cases like
select distinct a.x, b.y from a, b where a.x = b.y;

We do that already, no?

regression=# create table foo (x int, y int);
CREATE TABLE
regression=# explain select distinct * from foo where x = 1;
QUERY PLAN
-----------------------------------------------------------------
Unique (cost=38.44..38.50 rows=11 width=8)
-> Sort (cost=38.44..38.47 rows=11 width=8)
Sort Key: y
-> Seq Scan on foo (cost=0.00..38.25 rows=11 width=8)
Filter: (x = 1)
(5 rows)

regression=# explain select distinct * from foo where x = y;
QUERY PLAN
-----------------------------------------------------------------
Unique (cost=38.44..38.50 rows=11 width=8)
-> Sort (cost=38.44..38.47 rows=11 width=8)
Sort Key: x
-> Seq Scan on foo (cost=0.00..38.25 rows=11 width=8)
Filter: (x = y)
(5 rows)

But if you do

regression=# explain select * from foo where x = y group by x, y;
QUERY PLAN
-----------------------------------------------------------------
Group (cost=38.44..38.52 rows=11 width=8)
Group Key: x, y
-> Sort (cost=38.44..38.47 rows=11 width=8)
Sort Key: x
-> Seq Scan on foo (cost=0.00..38.25 rows=11 width=8)
Filter: (x = y)
(6 rows)

then you can see that the Sort step knows it need only consider
one column even though the Group step considers both.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: Removing redundant grouping columns

I wrote:

Richard Guo <guofenglinux@gmail.com> writes:

While we are here, I wonder if we can do the same trick for
distinctClause, to cope with cases like
select distinct a.x, b.y from a, b where a.x = b.y;

We do that already, no?

Oh, wait, I see what you mean: we are smart in code paths that rely
on distinct_pathkeys, but not in the hash-based code paths. Right,
that can be fixed the same way. 0001 attached is the same as before,
0002 adds similar logic for the distinctClause.

The plan change in expected/pg_trgm.out is surprising at first
glance, but I believe it's correct: the item that is being
dropped is a parameterless STABLE function, so its value is not
supposed to change for the duration of the scan.

regards, tom lane

Attachments:

v3-0001-remove-redundant-GROUP-BY.patchtext/x-diff; charset=us-ascii; name=v3-0001-remove-redundant-GROUP-BY.patchDownload+256-151
v3-0002-remove-redundant-DISTINCT.patchtext/x-diff; charset=us-ascii; name=v3-0002-remove-redundant-DISTINCT.patchDownload+47-17
#6vignesh C
vignesh21@gmail.com
In reply to: Tom Lane (#5)
Re: Removing redundant grouping columns

On Sat, 31 Dec 2022 at 02:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Richard Guo <guofenglinux@gmail.com> writes:

While we are here, I wonder if we can do the same trick for
distinctClause, to cope with cases like
select distinct a.x, b.y from a, b where a.x = b.y;

We do that already, no?

Oh, wait, I see what you mean: we are smart in code paths that rely
on distinct_pathkeys, but not in the hash-based code paths. Right,
that can be fixed the same way. 0001 attached is the same as before,
0002 adds similar logic for the distinctClause.

The plan change in expected/pg_trgm.out is surprising at first
glance, but I believe it's correct: the item that is being
dropped is a parameterless STABLE function, so its value is not
supposed to change for the duration of the scan.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4083.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch ./v3-0002-remove-redundant-DISTINCT.patch
....
Hunk #4 FAILED at 4704.
....
1 out of 10 hunks FAILED -- saving rejects to file
src/backend/optimizer/plan/planner.c.rej

[1]: http://cfbot.cputube.org/patch_41_4083.log

Regards,
Vignesh

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: vignesh C (#6)
Re: Removing redundant grouping columns

vignesh C <vignesh21@gmail.com> writes:

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Yeah, sideswiped by 3c6fc5820 apparently. No substantive change needed.

regards, tom lane

Attachments:

v4-0001-remove-redundant-GROUP-BY.patchtext/x-diff; charset=us-ascii; name=v4-0001-remove-redundant-GROUP-BY.patchDownload+256-151
v4-0002-remove-redundant-DISTINCT.patchtext/x-diff; charset=us-ascii; name=v4-0002-remove-redundant-DISTINCT.patchDownload+47-17
#8Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#7)
Re: Removing redundant grouping columns

On Sun, Jan 15, 2023 at 5:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

vignesh C <vignesh21@gmail.com> writes:

The patch does not apply on top of HEAD as in [1], please post a rebased

patch:

Yeah, sideswiped by 3c6fc5820 apparently. No substantive change needed.

I looked through these two patches and they look good to me.

BTW, another run of rebase is needed, due to da5800d5fa.

Thanks
Richard

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: Removing redundant grouping columns

I wrote:

Yeah, sideswiped by 3c6fc5820 apparently. No substantive change needed.

And immediately sideswiped by da5800d5f.

If nobody has any comments on this, I'm going to go ahead and push
it. The value of the improvement is rapidly paling in comparison
to the patch's maintenance effort.

regards, tom lane

Attachments:

v5-0001-remove-redundant-GROUP-BY.patchtext/x-diff; charset=us-ascii; name=v5-0001-remove-redundant-GROUP-BY.patchDownload+256-151
v5-0002-remove-redundant-DISTINCT.patchtext/x-diff; charset=us-ascii; name=v5-0002-remove-redundant-DISTINCT.patchDownload+47-17
#10Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#9)
Re: Removing redundant grouping columns

On Wed, Jan 18, 2023 at 6:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Yeah, sideswiped by 3c6fc5820 apparently. No substantive change needed.

And immediately sideswiped by da5800d5f.

Yeah, I noticed this too yesterday. I reviewed through these two
patches yesterday and I think they are in good shape now.

Thanks
Richard

#11David Rowley
dgrowleyml@gmail.com
In reply to: Richard Guo (#10)
Re: Removing redundant grouping columns

On Wed, 18 Jan 2023 at 14:55, Richard Guo <guofenglinux@gmail.com> wrote:

Yeah, I noticed this too yesterday. I reviewed through these two
patches yesterday and I think they are in good shape now.

I'm currently reviewing the two patches.

David

#12David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#9)
Re: Removing redundant grouping columns

On Wed, 18 Jan 2023 at 11:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If nobody has any comments on this, I'm going to go ahead and push
it. The value of the improvement is rapidly paling in comparison
to the patch's maintenance effort.

No objections from me.

David

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#12)
Re: Removing redundant grouping columns

David Rowley <dgrowleyml@gmail.com> writes:

No objections from me.

Pushed, thanks for looking at it.

regards, tom lane