Group by reordering optimization

Started by Dmitry Dolgovover 5 years ago6 messageshackers
Jump to latest
#1Dmitry Dolgov
9erthalion6@gmail.com

Hi,

Better late than never, to follow up on the original thread [1]/messages/by-id/7c79e6a5-8597-74e8-0671-1c39d124c9d6@sigaev.ru I would like to
continue the discussion with the another version of the patch for group by
reordering optimization. To remind, it's about reordering of group by clauses
to do sorting more efficiently. The patch is rebased and modified to address
(at least partially) the suggestions about making it consider new additional
paths instead of changing original ones. It is still pretty much
proof-of-concept version though with many blind spots, but I wanted to start
kicking it and post at least something, otherwise it will never happen. An
incremental approach so to say.

In many ways it still contains the original code from Teodor. Changes and notes:

* Instead of changing the order directly, now patch creates another patch with
modifier order of clauses. It does so for the normal sort as well as for
incremental sort. The whole thing is done in two steps: first it finds a
potentially better ordering taking into account number of groups, widths and
comparison costs; afterwards this information is used to produce a cost
estimation. This is implemented via a separate create_reordered_sort_path to
not introduce too many changes, I couldn't find any better place.

* Function get_func_cost was removed at some point, but unfortunately this
patch was implemented before that, so it's still present there.

* For simplicity I've removed support in create_partial_grouping_paths, since
they were not covered by the existing tests anyway.

* The costing part is pretty rudimentary and looks only at the first group.
It's mostly hand crafted to pass the existing tests.

The question about handling skewed data sets is not addressed yet.

[1]: /messages/by-id/7c79e6a5-8597-74e8-0671-1c39d124c9d6@sigaev.ru

Attachments:

0001-Group-by-optimization.patchapplication/octet-stream; name=0001-Group-by-optimization.patchDownload+1086-63
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dmitry Dolgov (#1)
Re: Group by reordering optimization

On Tue, Sep 01, 2020 at 01:15:31PM +0200, Dmitry Dolgov wrote:

Hi,

Better late than never, to follow up on the original thread [1] I would like to
continue the discussion with the another version of the patch for group by
reordering optimization. To remind, it's about reordering of group by clauses
to do sorting more efficiently. The patch is rebased and modified to address
(at least partially) the suggestions about making it consider new additional
paths instead of changing original ones. It is still pretty much
proof-of-concept version though with many blind spots, but I wanted to start
kicking it and post at least something, otherwise it will never happen. An
incremental approach so to say.

In many ways it still contains the original code from Teodor. Changes and notes:

* Instead of changing the order directly, now patch creates another patch with
modifier order of clauses. It does so for the normal sort as well as for
incremental sort. The whole thing is done in two steps: first it finds a
potentially better ordering taking into account number of groups, widths and
comparison costs; afterwards this information is used to produce a cost
estimation. This is implemented via a separate create_reordered_sort_path to
not introduce too many changes, I couldn't find any better place.

I haven't tested the patch with any queries, but I agree this seems like
the right approach in general.

I'm a bit worried about how complex the code in planner.c is getting -
the incremental sort patch already made it a bit too complex, and this
is just another step in that direction. I suppose we should refactor
add_paths_to_grouping_rel() by breaking it into smaller / more readable
pieces ...

* Function get_func_cost was removed at some point, but unfortunately this
patch was implemented before that, so it's still present there.

* For simplicity I've removed support in create_partial_grouping_paths, since
they were not covered by the existing tests anyway.

Hmmm, OK. I think that's something we'll need to address for the final
patch, but I agree we can add it after improving the costing etc.

* The costing part is pretty rudimentary and looks only at the first group.
It's mostly hand crafted to pass the existing tests.

OK, understood.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Tomas Vondra (#2)
Re: Group by reordering optimization

On Tue, Sep 1, 2020 at 2:09 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

* Instead of changing the order directly, now patch creates another patch with
modifier order of clauses. It does so for the normal sort as well as for
incremental sort. The whole thing is done in two steps: first it finds a
potentially better ordering taking into account number of groups, widths and
comparison costs; afterwards this information is used to produce a cost
estimation. This is implemented via a separate create_reordered_sort_path to
not introduce too many changes, I couldn't find any better place.

I haven't tested the patch with any queries, but I agree this seems like
the right approach in general.

If we're creating a new sort path anyway, then perhaps we can also
change the collation -- it might be possible to "reduce" it to the "C"
collation without breaking queries.

This is admittedly pretty hard to do well. It could definitely work
out when we have to do a sort anyway -- a sort with high cardinality
abbreviated keys will be very fast (though we can't use abbreviated
keys with libc collations right now). OTOH, it would be quite
counterproductive if we were hoping to get an incremental sort that
used some available index that happens to use the default collation
(which is not the C collation in cases where this optimization is
expected to help).

--
Peter Geoghegan

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#3)
Re: Group by reordering optimization

On Tue, Sep 01, 2020 at 03:09:14PM -0700, Peter Geoghegan wrote:

On Tue, Sep 1, 2020 at 2:09 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

* Instead of changing the order directly, now patch creates another patch with
modifier order of clauses. It does so for the normal sort as well as for
incremental sort. The whole thing is done in two steps: first it finds a
potentially better ordering taking into account number of groups, widths and
comparison costs; afterwards this information is used to produce a cost
estimation. This is implemented via a separate create_reordered_sort_path to
not introduce too many changes, I couldn't find any better place.

I haven't tested the patch with any queries, but I agree this seems like
the right approach in general.

If we're creating a new sort path anyway, then perhaps we can also
change the collation -- it might be possible to "reduce" it to the "C"
collation without breaking queries.

This is admittedly pretty hard to do well. It could definitely work
out when we have to do a sort anyway -- a sort with high cardinality
abbreviated keys will be very fast (though we can't use abbreviated
keys with libc collations right now). OTOH, it would be quite
counterproductive if we were hoping to get an incremental sort that
used some available index that happens to use the default collation
(which is not the C collation in cases where this optimization is
expected to help).

Even if reducing collations like this was possible (I have no idea how
tricky it is, my knowledge of collations is pretty minimal and from what
I know I'm not dying to learn more), I suggest we consider that out of
scope for this particular patch.

There are multiple open issues already - deciding which pathkeys are
interesting, reasonable costing, etc. Once those issues are solved, we
can consider tweaking collations as an additional optimizations.

Or maybe we can consider it entirely separately, i.e. why would it
matter if we re-order the GROUP BY keys? The collation reduction can
just as well help even if we use the same pathkeys.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Tomas Vondra (#2)
Re: Group by reordering optimization

On Tue, Sep 01, 2020 at 11:08:54PM +0200, Tomas Vondra wrote:
On Tue, Sep 01, 2020 at 01:15:31PM +0200, Dmitry Dolgov wrote:

Hi,

Better late than never, to follow up on the original thread [1] I would like to
continue the discussion with the another version of the patch for group by
reordering optimization. To remind, it's about reordering of group by clauses
to do sorting more efficiently. The patch is rebased and modified to address
(at least partially) the suggestions about making it consider new additional
paths instead of changing original ones. It is still pretty much
proof-of-concept version though with many blind spots, but I wanted to start
kicking it and post at least something, otherwise it will never happen. An
incremental approach so to say.

In many ways it still contains the original code from Teodor. Changes and notes:

* Instead of changing the order directly, now patch creates another patch with
modifier order of clauses. It does so for the normal sort as well as for
incremental sort. The whole thing is done in two steps: first it finds a
potentially better ordering taking into account number of groups, widths and
comparison costs; afterwards this information is used to produce a cost
estimation. This is implemented via a separate create_reordered_sort_path to
not introduce too many changes, I couldn't find any better place.

I haven't tested the patch with any queries, but I agree this seems like
the right approach in general.

I'm a bit worried about how complex the code in planner.c is getting -
the incremental sort patch already made it a bit too complex, and this
is just another step in that direction. I suppose we should refactor
add_paths_to_grouping_rel() by breaking it into smaller / more readable
pieces ...

Yes, that was my impression as well. I'll try to make such refactoring
either as a separate patch or a part of the main one.

* Function get_func_cost was removed at some point, but unfortunately this
patch was implemented before that, so it's still present there.

* For simplicity I've removed support in create_partial_grouping_paths, since
they were not covered by the existing tests anyway.

Hmmm, OK. I think that's something we'll need to address for the final
patch, but I agree we can add it after improving the costing etc.

Sure, I plan to return it in time. IIUC in the original patch series
this code wasn't covered with tests, so I've decided to minimize the
changes.

#6Andrei Lepikhov
lepihov@gmail.com
In reply to: Tomas Vondra (#4)
Re: POC: GROUP BY optimization

On 9/2/20 9:12 PM, Tomas Vondra wrote:

We could simply use the input "tuples" value here, and then divide the
current and previous estimate to calculate the number of new groups.

Performing a review of this patch I made a number of changes (see
cleanup.txt). Maybe it will be useful.
As I see, the code, which implements the main idea, is quite stable.
Doubts localized in the cost estimation routine. Maybe try to finish
this work by implementing an conservative strategy to a cost estimation
of sorting?

--
regards,
Andrey Lepikhov
Postgres Professional

Attachments:

cleanup.txttext/plain; charset=UTF-8; name=cleanup.txtDownload+14-22
group-by-reorder-20211228.patchtext/x-patch; charset=UTF-8; name=group-by-reorder-20211228.patchDownload+1892-486