Removing useless DISTINCT clauses

Started by David Rowleyover 8 years ago39 messageshackers
Jump to latest
#1David Rowley
dgrowleyml@gmail.com

In [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d4c3a156cb46dcd1f9f97a8011bd94c544079bb5 we made a change to process the GROUP BY clause to remove any
group by items that are functionally dependent on some other GROUP BY
items.

This really just checks if a table's PK columns are entirely present
in the GROUP BY clause and removes anything else belonging to that
table.

All this seems to work well, but I totally failed to consider that the
exact same thing applies to DISTINCT too.

Over in [2]/messages/by-id/CAKJS1f9q0j3BgMUsDbtf9=ecfVLnqvkYB44MXj0gpVuamcN8Xw@mail.gmail.com, Rui Liu mentions that the planner could do a better job
for his case.

Using Rui Liu's example:

CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text);
INSERT into test_tbl select generate_series(1,10000000), 'test';

Master:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual
time=8934.962..8935.495 rows=1000 loops=1)
Output: col, k
-> Unique (cost=1658556.19..1733557.50 rows=10000175 width=9)
(actual time=8934.961..8935.460 rows=1000 loops=1)
Output: col, k
-> Sort (cost=1658556.19..1683556.63 rows=10000175 width=9)
(actual time=8934.959..8935.149 rows=1000 loops=1)
Output: col, k
Sort Key: test_tbl.k, test_tbl.col
Sort Method: external merge Disk: 215128kB
-> Seq Scan on public.test_tbl (cost=0.00..154056.75
rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000
loops=1)
Output: col, k
Planning time: 0.092 ms
Execution time: 8958.687 ms
(12 rows)

Patched:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895
rows=1000 loops=1)
Output: col, k
-> Unique (cost=0.44..338745.50 rows=10000175 width=9) (actual
time=0.029..0.814 rows=1000 loops=1)
Output: col, k
-> Index Scan using test_tbl_pkey on public.test_tbl
(cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452
rows=1000 loops=1)
Output: col, k
Planning time: 0.152 ms
Execution time: 0.985 ms
(8 rows)

A patch to implement this is attached.

I'll add it to the Jan commitfest. (I don't expect anyone to look at
this before then).

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d4c3a156cb46dcd1f9f97a8011bd94c544079bb5

[2]: /messages/by-id/CAKJS1f9q0j3BgMUsDbtf9=ecfVLnqvkYB44MXj0gpVuamcN8Xw@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

remove_useless_distinct_clauses.patchapplication/octet-stream; name=remove_useless_distinct_clauses.patchDownload+149-36
#2Jeff Janes
jeff.janes@gmail.com
In reply to: David Rowley (#1)
Re: [HACKERS] Removing useless DISTINCT clauses

On Mon, Nov 6, 2017 at 1:16 AM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

In [1] we made a change to process the GROUP BY clause to remove any
group by items that are functionally dependent on some other GROUP BY
items.

This really just checks if a table's PK columns are entirely present
in the GROUP BY clause and removes anything else belonging to that
table.

All this seems to work well, but I totally failed to consider that the
exact same thing applies to DISTINCT too.

Over in [2], Rui Liu mentions that the planner could do a better job
for his case.

Using Rui Liu's example:

CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text);
INSERT into test_tbl select generate_series(1,10000000), 'test';

Master:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
QUERY
PLAN
------------------------------------------------------------
------------------------------------------------------------
-------------------------
Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual
time=8934.962..8935.495 rows=1000 loops=1)
Output: col, k
-> Unique (cost=1658556.19..1733557.50 rows=10000175 width=9)
(actual time=8934.961..8935.460 rows=1000 loops=1)
Output: col, k
-> Sort (cost=1658556.19..1683556.63 rows=10000175 width=9)
(actual time=8934.959..8935.149 rows=1000 loops=1)
Output: col, k
Sort Key: test_tbl.k, test_tbl.col
Sort Method: external merge Disk: 215128kB
-> Seq Scan on public.test_tbl (cost=0.00..154056.75
rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000
loops=1)
Output: col, k
Planning time: 0.092 ms
Execution time: 8958.687 ms
(12 rows)

Patched:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;

QUERY PLAN
------------------------------------------------------------
------------------------------------------------------------
----------------------------------
Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895
rows=1000 loops=1)
Output: col, k
-> Unique (cost=0.44..338745.50 rows=10000175 width=9) (actual
time=0.029..0.814 rows=1000 loops=1)
Output: col, k
-> Index Scan using test_tbl_pkey on public.test_tbl
(cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452
rows=1000 loops=1)
Output: col, k
Planning time: 0.152 ms
Execution time: 0.985 ms
(8 rows)

A patch to implement this is attached.

Couldn't the Unique node be removed entirely? If k is a primary key, you
can't have duplicates in need of removal.

Or would that be a subject for a different patch?

I think remove_functionally_dependant_groupclauses should have a more
generic name, like remove_functionally_dependant_clauses.

Cheers,

Jeff

#3David Rowley
dgrowleyml@gmail.com
In reply to: Jeff Janes (#2)
Re: [HACKERS] Removing useless DISTINCT clauses

Hi Jeff,

Thanks for looking at the patch.

On 6 January 2018 at 10:34, Jeff Janes <jeff.janes@gmail.com> wrote:

Couldn't the Unique node be removed entirely? If k is a primary key, you
can't have duplicates in need of removal.

It probably could be, if there were no joins, but any join could
duplicate those rows, so we'd only be able to do that if there were no
joins, and I'm not sure we'd want to special case that. It seems just
too naive to be taken seriously. We could do a lot better...

Or would that be a subject for a different patch?

Yeah, I'd rather do that fix properly, and do have ideas on how to,
just it's a lot more work (basically testing if the joins can
duplicate rows or not (unique joins++)).

I think remove_functionally_dependant_groupclauses should have a more
generic name, like remove_functionally_dependant_clauses.

It's been a while since I looked at this. I remember thinking
something similar at the time but I must have not changed it.

I'll look again and post an updated patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#3)
Re: [HACKERS] Removing useless DISTINCT clauses

On 6 January 2018 at 23:08, David Rowley <david.rowley@2ndquadrant.com> wrote:

I think remove_functionally_dependant_groupclauses should have a more
generic name, like remove_functionally_dependant_clauses.

It's been a while since I looked at this. I remember thinking
something similar at the time but I must have not changed it.

I'll look again and post an updated patch.

I've attached a patch that renames the function as you've suggested.

Thanks again for the review.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

remove_useless_distinct_clauses_v2.patchapplication/octet-stream; name=remove_useless_distinct_clauses_v2.patchDownload+148-36
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#4)
Re: [HACKERS] Removing useless DISTINCT clauses

David Rowley <david.rowley@2ndquadrant.com> writes:

[ remove_useless_distinct_clauses_v2.patch ]

This is a cute idea, but I'm troubled by a couple of points:

1. Once you don't have all the tlist items shown in DISTINCT, it really is
more like DISTINCT ON, seems like. I am not sure it's a good idea to set
hasDistinctOn, because that engages some planner behaviors we probably
don't want, but I'm also not sure we can get away with just ignoring the
difference. As an example, in allpaths.c there are assorted assumptions
that having a distinctClause but !hasDistinctOn means all output columns
of a subquery are listed in the distinctClause.

2. There's a comment in planner.c to the effect that

* When we have DISTINCT ON, we must sort by the more rigorous of
* DISTINCT and ORDER BY, else it won't have the desired behavior.
* Also, if we do have to do an explicit sort, we might as well use
* the more rigorous ordering to avoid a second sort later. (Note
* that the parser will have ensured that one clause is a prefix of
* the other.)

Removing random elements of the distinctClause will break its
correspondence with the sortClause, with probably bad results.

I do not remember for sure at the moment, but it may be that this
correspondence is only important for the case of DISTINCT ON, in which
case we could dodge the problem by not applying the optimization unless
it's plain DISTINCT. That doesn't help us with point 1 though.

BTW, my dictionary says it's "dependent" not "dependant".

regards, tom lane

#6David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#5)
Re: [HACKERS] Removing useless DISTINCT clauses

Hi Tom,

Thanks for looking at this.

On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is a cute idea, but I'm troubled by a couple of points:

1. Once you don't have all the tlist items shown in DISTINCT, it really is
more like DISTINCT ON, seems like. I am not sure it's a good idea to set
hasDistinctOn, because that engages some planner behaviors we probably
don't want, but I'm also not sure we can get away with just ignoring the
difference. As an example, in allpaths.c there are assorted assumptions
that having a distinctClause but !hasDistinctOn means all output columns
of a subquery are listed in the distinctClause.

hmm, I see:

/*
* If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
* time: all its output columns must be used in the distinctClause.
*/
if (subquery->distinctClause && !subquery->hasDistinctOn)
return;

I think that particular one would just fail to remove columns from the
subquery if there's a distinct clause (not distinct on). That seems
more like a missed optimisation rather than a future bug. Although it
probably would be a good idea to get rid of the assumption that a
non-NIL distinctClause && !hasDistinctOn means that all output target
items are part of that distinct clause.

I don't see anything else in allpaths.c that would be affected.
Keeping in mind we'll never completely remove a distinctClause, just
possibly remove some items from the list. Everything else just seems
to be checking distinctClause != NIL, or is only doing something if
hasDistinctOn == true.

I'll do some more analysis on places that distinctClause is being used
to check what's safe.

2. There's a comment in planner.c to the effect that

* When we have DISTINCT ON, we must sort by the more rigorous of
* DISTINCT and ORDER BY, else it won't have the desired behavior.
* Also, if we do have to do an explicit sort, we might as well use
* the more rigorous ordering to avoid a second sort later. (Note
* that the parser will have ensured that one clause is a prefix of
* the other.)

Removing random elements of the distinctClause will break its
correspondence with the sortClause, with probably bad results.

I do not remember for sure at the moment, but it may be that this
correspondence is only important for the case of DISTINCT ON, in which
case we could dodge the problem by not applying the optimization unless
it's plain DISTINCT. That doesn't help us with point 1 though.

Yeah, it seems better to only try and apply this for plain DISTINCT.

BTW, my dictionary says it's "dependent" not "dependant".

Oops. Thanks.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Andres Freund
andres@anarazel.de
In reply to: David Rowley (#6)
Re: [HACKERS] Removing useless DISTINCT clauses

Hi,

On 2018-01-10 11:12:17 +1300, David Rowley wrote:

I'll do some more analysis on places that distinctClause is being used
to check what's safe.

This patch has been waiting on author since 2018-01-09, the next & last
CF has started. I'm inclined to mark this as returned with feedback.

Greetings,

Andres Freund

#8David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#7)
Re: [HACKERS] Removing useless DISTINCT clauses

On 2 March 2018 at 09:51, Andres Freund <andres@anarazel.de> wrote:

This patch has been waiting on author since 2018-01-09, the next & last
CF has started. I'm inclined to mark this as returned with feedback.

I'm planning on making the required changes at the weekend.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#5)
Re: [HACKERS] Removing useless DISTINCT clauses

On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. Once you don't have all the tlist items shown in DISTINCT, it really is
more like DISTINCT ON, seems like. I am not sure it's a good idea to set
hasDistinctOn, because that engages some planner behaviors we probably
don't want, but I'm also not sure we can get away with just ignoring the
difference. As an example, in allpaths.c there are assorted assumptions
that having a distinctClause but !hasDistinctOn means all output columns
of a subquery are listed in the distinctClause.

Looking through allpaths.c for usages of distinctClauses it seems
they're all either interested in distinctClauses being non-NIL, which
we'll have no effect on. There's one that only cares about distinctOn
only, and then there's:

/*
* If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
* time: all its output columns must be used in the distinctClause.
*/
if (subquery->distinctClause && !subquery->hasDistinctOn)
return;

I don't think removing this fast-path would currently have any benefit
as we're not setting the ressortgroupref for the targetlist items
which we've removed the distinctClause for.

If we did that then, you might think that the following code in
remove_unused_subquery_outputs() might be able to remove the
targetlist item for a useless distinct item we've removed:

/*
* If it has a sortgroupref number, it's used in some sort/group
* clause so we'd better not remove it. Also, don't remove any
* resjunk columns, since their reason for being has nothing to do
* with anybody reading the subquery's output. (It's likely that
* resjunk columns in a sub-SELECT would always have ressortgroupref
* set, but even if they don't, it seems imprudent to remove them.)
*/
if (tle->ressortgroupref || tle->resjunk)
continue;

but, it can't because we've not removed them yet.
remove_unused_subquery_outputs() is called before subquery_planner()
in set_subquery_pathlist(), so remove_useless_distinct_columns() won't
even have been called by the time we get to the above code. So it
looks like zeroing the ressortgroupref's of the TargetEntry items
belonging to the removed distinct items would be wasted effort. I also
can't imagine why we'd ever try to remove unused outputs from a
subquery that's already been planned since almost all of the benefits
are from allowing the planner more flexibility to do things like join
removal, so I don't think there's any danger of this code becoming
broken later.

I looked in other places in the planner which are looking at
distinctClauses. Many are just testing distinctClauses == NIL, which
are not affected here since we'll always leave at least 1 item in that
List.

Another place that looks interesting is query_is_distinct_for() in
analyzejoins.c. However, the subquery is being analyzed here long
before it's getting planned, so all the distinctClauses will be fully
intact at that time.

2. There's a comment in planner.c to the effect that

* When we have DISTINCT ON, we must sort by the more rigorous of
* DISTINCT and ORDER BY, else it won't have the desired behavior.
* Also, if we do have to do an explicit sort, we might as well use
* the more rigorous ordering to avoid a second sort later. (Note
* that the parser will have ensured that one clause is a prefix of
* the other.)

Removing random elements of the distinctClause will break its
correspondence with the sortClause, with probably bad results.

I do not remember for sure at the moment, but it may be that this
correspondence is only important for the case of DISTINCT ON, in which
case we could dodge the problem by not applying the optimization unless
it's plain DISTINCT. That doesn't help us with point 1 though.

Good point, although we could probably just apply the same
optimization to the ORDER BY to sidestep that. I've left a note in
that area as something to think about for another day, although I
think the refactoring done in this patch would make it a pretty easy
thing to fix.

BTW, my dictionary says it's "dependent" not "dependant".

You're right, regardless if that dictionary says Webster's or Oxford
on the front. Fixed.

I've attached an updated patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

remove_useless_distinct_clauses_v3.patchapplication/octet-stream; name=remove_useless_distinct_clauses_v3.patchDownload+174-37
#10Melanie Plageman
melanieplageman@gmail.com
In reply to: David Rowley (#9)
Re: Removing useless DISTINCT clauses

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

This is a review of the patch to remove useless DISTINCT columns from the DISTINCT clause
/messages/by-id/CAKJS1f8UMJ137sRuVSnEMDDpa57Q71JuLZi4yLCFMekNYVYqaQ@mail.gmail.com

Contents & Purpose
==================

This patch removes any additional columns in the DISTINCT clause when the
table's primary key columns are entirely present in the DISTINCT clause. This
optimization works because the PK columns functionally determine the other
columns in the relation. The patch includes regression test cases.

Initial Run
===========
The patch applies cleanly to HEAD. All tests which are run as part of the
`installcheck` target pass correctly (all existing tests pass, all new tests
pass with the patch applied and fail without it). All TAP tests pass. All tests
which are run as part of the `installcheck-world` target pass except one of the
Bloom contrib module tests in `contrib/bloom/sql/bloom.sql`,
`CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);`
which is currently also failing (crashing) for me on master, so it is unrelated to the
patch. The test cases seem to cover the new behavior.

Functionality
=============
Given that this optimization is safe for the GROUP BY clause (you can remove
functionally dependent attributes from the clause there as well), the logic
does seem to follow for DISTINCT. It seems semantically correct to do this. As
for the implementation, the author seems to have reasonably addressed concerns
expressed over the distinction between DISTINCT and DISTINCT ON. As far as I
can see, this should be fine.

Code Review
===========
For a small performance hit but an improvement in readability, the length check
can be moved from the individual group by and distinct clause checks into the
helper function

if (list_length(parse->distinctClause) < 2)
return;

and

if (list_length(parse->groupClause) < 2)
return;

can be moved into `remove_functionally_dependent_clauses`

Comments/Documentation
=======================

The main helper function that is added `remove_functionally_dependent_clauses`
uses one style of comment--with the name of the function and then the rest of
the description indented which is found some places in the code,
/*
* remove_functionally_dependent_clauses
* Processes clauselist and removes any items which are deemed to be
* functionally dependent on other clauselist items.
*
* If any item from the list can be removed, then a new list is built which
* does not contain the removed items. If no item can be removed then the
* original list is returned.
*/

while other helper functions in the same file use a different style--all lines
flush to the side with a space. I'm not sure which is preferred

/*
* limit_needed - do we actually need a Limit plan node?
*
* If we have constant-zero OFFSET and constant-null LIMIT, we can skip adding
* a Limit node. This is worth checking for because "OFFSET 0" is a common
* locution for an optimization fence. (Because other places in the planner
...

it looks like the non-indented ones are from older commits (2013 vs 2016).

The list length change is totally optional, but I'll leave it as Waiting On Author in case the author wants to make that change.

Overall, LGTM.

The new status of this patch is: Waiting on Author

#11David Rowley
dgrowleyml@gmail.com
In reply to: Melanie Plageman (#10)
Re: Removing useless DISTINCT clauses

On 21 March 2018 at 16:29, Melanie Plageman <melanieplageman@gmail.com> wrote:

For a small performance hit but an improvement in readability, the length check
can be moved from the individual group by and distinct clause checks into the
helper function

if (list_length(parse->distinctClause) < 2)
return;

and

if (list_length(parse->groupClause) < 2)
return;

can be moved into `remove_functionally_dependent_clauses`

I remember thinking about this when writing, and I think I ended up
doing the check earlier as I thought that having 1 GROUP BY clause
item would be the most common case, so it seems like a good idea to
abort as early as possible in that case. That's most likely not the
case for DISTINCT.

I imagine it probably does not make that much difference anyway, so
I've moved it into the remove_functionally_dependent_clauses()
function, as mentioned.

The main helper function that is added `remove_functionally_dependent_clauses`
uses one style of comment--with the name of the function and then the rest of
the description indented which is found some places in the code,
/*
* remove_functionally_dependent_clauses
* Processes clauselist and removes any items which are deemed to be
* functionally dependent on other clauselist items.
*
* If any item from the list can be removed, then a new list is built which
* does not contain the removed items. If no item can be removed then the
* original list is returned.
*/

while other helper functions in the same file use a different style--all lines
flush to the side with a space. I'm not sure which is preferred

You might have a point, but remove_useless_groupby_columns already
does it this way, so I don't think changing that is a good idea.

The new status of this patch is: Waiting on Author

Thanks for reviewing this. I've attached an updated patch. I'll set
back to waiting for review again.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

remove_useless_distinct_clauses_v4.patchapplication/octet-stream; name=remove_useless_distinct_clauses_v4.patchDownload+174-41
#12Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Melanie Plageman (#10)
Re: Removing useless DISTINCT clauses

Melanie Plageman wrote:

Contents & Purpose
==================

This patch removes any additional columns in the DISTINCT clause when the
table's primary key columns are entirely present in the DISTINCT clause. This
optimization works because the PK columns functionally determine the other
columns in the relation. The patch includes regression test cases.

Would it be very difficult to extend that to "if any unique constraints are
contained in the DISTINCT clause"?

Yours,
Laurenz Albe

#13David Rowley
dgrowleyml@gmail.com
In reply to: Laurenz Albe (#12)
Re: Removing useless DISTINCT clauses

On 22 March 2018 at 21:16, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Would it be very difficult to extend that to "if any unique constraints are
contained in the DISTINCT clause"?

Unfortunately, it's quite a bit more work to make that happen. It's
not just unique constraints that are required to make this work. We
also need to pay attention to NOT NULL constraints too, as we're
unable to prove function dependency when the keys have NULLs, as there
may be any number of rows containing NULL values.

The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on. We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute. For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for this
patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14Laurenz Albe
laurenz.albe@cybertec.at
In reply to: David Rowley (#13)
Re: Removing useless DISTINCT clauses

David Rowley wrote:

Would it be very difficult to extend that to "if any unique constraints are
contained in the DISTINCT clause"?

Unfortunately, it's quite a bit more work to make that happen. It's
not just unique constraints that are required to make this work. We
also need to pay attention to NOT NULL constraints too, as we're
unable to prove function dependency when the keys have NULLs, as there
may be any number of rows containing NULL values.

The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on. We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute. For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for this
patch.

That makes sense; thanks for explaining.

Yours,
Laurenz Albe

#15Jim Finnerty
jfinnert@amazon.com
In reply to: David Rowley (#13)
Re: Removing useless DISTINCT clauses

David, first of all, I'm delighted to see this change. I've been intending
to give you feedback on it for a while, but I've been flat out with work, so
I'm sorry.

Any improvement in our DISTINCT optimization would be helpful. Note,
though, that DISTINCT optimization can include the surrounding query block
context, because sometimes the DISTINCT is required, sometimes it is
prohibited, and sometimes it is optional. When DISTINCT is optional, you
can produce distinct rows if it is efficient and convenient to do so, as you
might from an index scan having leading keys that match the DISTINCT, but
you

Consider if you have a SELECT DISTINCT under a UNION that does a distinct on
the same grouping keys. In that case, the UNION will do the distinct
operation anyway, so the SELECT DISTINCT can become a SELECT.

Please search for and read the old-but-good paper entitled "Extensible/Rule
Based Query Optimization in Starburst", by Pirahesh et. al.

Distinctness can also be preserved across joins, so if you have a 'snowflake
query' type join, where all the joins are to a unique key, then the
distinctness of the other side of the join is preserved. For example, a
SELECT DISTINCT * FROM fact_table ... that joins from each column in its
compound primary key to a unique key of another (dimension) table would
remain distinct, and so you could drop the DISTINCT from the query.

This is a relatively weak area of PostgreSQL at the moment, so IMHO we
really need this.

Regarding constraint OIDs and invalidation when a NOT NULL column is
modified to remove this constraint, this would be accomplished with an ALTER
TABLE <t> ALTER COLUMN <c> <type> NULL.

Why wouldn't / can't the ALTER TABLE cause an invalidation of cached plans
that depend on table t?

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

#16Melanie Plageman
melanieplageman@gmail.com
In reply to: David Rowley (#13)
Re: Removing useless DISTINCT clauses

On Thu, Mar 22, 2018 at 5:20 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on. We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute. For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for this
patch.

I was just wondering, since get_primary_key_attnos opens pg_constraint and
just skips attributes with other constraint types than primary, what would
be the reason we wouldn't just also open pg_attribute and check for the
non-nullness of the non-primary key unique attributes?
Couldn't we add a function which opens both pg_attribute and pg_constraint
to get non-null unique attributes?
I suppose that would have a performance impact.
And, I did notice the FIXME in the comment before check_functional_grouping
which seems to make the argument that there are other use cases for wanting
the non-nullness represented in pg_constraint.

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique
constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints
added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.
*/
--
Melanie Plageman

#17David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#11)
Re: Removing useless DISTINCT clauses

On 22 March 2018 at 18:58, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 21 March 2018 at 16:29, Melanie Plageman <melanieplageman@gmail.com> wrote:

The new status of this patch is: Waiting on Author

Thanks for reviewing this. I've attached an updated patch. I'll set
back to waiting for review again.

I've attached another version of the patch. The only change is a
thinko in a comment which claimed it may be possible to do something
with DISTINCT ON clauses, but on further thought, I don't think that's
possible, so I've updated the comment to reflect my new thoughts.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

remove_useless_distinct_clauses_v5.patchapplication/octet-stream; name=remove_useless_distinct_clauses_v5.patchDownload+174-41
#18David Rowley
dgrowleyml@gmail.com
In reply to: Melanie Plageman (#16)
Re: Removing useless DISTINCT clauses

On 24 March 2018 at 05:55, Melanie Plageman <melanieplageman@gmail.com> wrote:

I was just wondering, since get_primary_key_attnos opens pg_constraint and
just skips attributes with other constraint types than primary, what would
be the reason we wouldn't just also open pg_attribute and check for the
non-nullness of the non-primary key unique attributes?
Couldn't we add a function which opens both pg_attribute and pg_constraint
to get non-null unique attributes?
I suppose that would have a performance impact.
And, I did notice the FIXME in the comment before check_functional_grouping
which seems to make the argument that there are other use cases for wanting
the non-nullness represented in pg_constraint.

It's certainly possible to do more here. I'm uncertain what needs to
be done in regards to cached plan invalidation, but we'd certainly
need to ensure cached plans are invalidated whenever the attnotnull is
changed.

There would be no need to look at the pg_attribute table directly. A
syscache function would just need added named get_attnotnull() which
would be akin to something like get_atttypmod().

I'm personally not planning on doing anything in that regard for this
patch. We have 1 week to go before PG11 feature freeze. My goals for
this patch was to apply the same optimization to DISTINCT as I did for
GROUP BY a while back, which is exactly what the patch does. I agree
that more would be nice, but unfortunately, time is finite.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19David Rowley
dgrowleyml@gmail.com
In reply to: Jim Finnerty (#15)
Re: Removing useless DISTINCT clauses

On 24 March 2018 at 01:42, Jim Finnerty <jfinnert@amazon.com> wrote:

Distinctness can also be preserved across joins, so if you have a 'snowflake
query' type join, where all the joins are to a unique key, then the
distinctness of the other side of the join is preserved. For example, a
SELECT DISTINCT * FROM fact_table ... that joins from each column in its
compound primary key to a unique key of another (dimension) table would
remain distinct, and so you could drop the DISTINCT from the query.

I'm aware. It is something I'm interested in but would require several
orders of magnitude more work than what I've done for this patch. You
may have noticed the other work I did a while back to detect if joins
cause row duplicate or not, so it's certainly something I've thought
about.

If Amazon would like to sponsor work in this area then please see [1]https://wiki.postgresql.org/wiki/How_to_sponsor_a_feature.

It certainly would be great to see that happen.

[1]: https://wiki.postgresql.org/wiki/How_to_sponsor_a_feature

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20Melanie Plageman
melanieplageman@gmail.com
In reply to: David Rowley (#19)
Re: Removing useless DISTINCT clauses

Hi David,
The updated patch looks good to me.
I've changed the status to "ready for committer"
Thanks

#21David Rowley
dgrowleyml@gmail.com
In reply to: Melanie Plageman (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#18)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
#24David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#23)
#25Jim Finnerty
jfinnert@amazon.com
In reply to: David Rowley (#24)
#26David Rowley
dgrowleyml@gmail.com
In reply to: Jim Finnerty (#25)
#27Jim Finnerty
jfinnert@amazon.com
In reply to: David Rowley (#26)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Finnerty (#27)
#29Jim Finnerty
jfinnert@amazon.com
In reply to: Alvaro Herrera (#28)
#30David Rowley
dgrowleyml@gmail.com
In reply to: Jim Finnerty (#29)
#31Stephen Frost
sfrost@snowman.net
In reply to: David Rowley (#30)
#32David Rowley
dgrowleyml@gmail.com
In reply to: Stephen Frost (#31)
#33Stephen Frost
sfrost@snowman.net
In reply to: Jim Finnerty (#27)
#34Stephen Frost
sfrost@snowman.net
In reply to: David Rowley (#32)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#34)
#36Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#35)
#37David Rowley
dgrowleyml@gmail.com
In reply to: Stephen Frost (#36)
#38Jim Finnerty
jfinnert@amazon.com
In reply to: David Rowley (#37)
#39David Rowley
dgrowleyml@gmail.com
In reply to: Jim Finnerty (#38)