[PATCH] Use optimized single-datum tuplesort in ExecSort
Hello,
While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0]https://commitfest.postgresql.org/33/3164/ I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).
I originally proposed this patch as a companion in the same thread [1]/messages/by-id/4480689.ObhdGn8bVM@aivenronan, but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.
The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).
The attached patch is originally by me, with some cleanup by Ranier Vilela.
I'm sending Ranier's version here.
[0]: https://commitfest.postgresql.org/33/3164/
[1]: /messages/by-id/4480689.ObhdGn8bVM@aivenronan
--
Ronan Dunklau
Attachments:
v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+57-22
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
Hello,
While testing the patch "Add proper planner support for ORDER BY /
DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort
node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1],
but
following James suggestion I'm making a separate thread just for this as
the
optimization is worthwhile independently of David's patch: it looks like
we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this
optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as
not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).The attached patch is originally by me, with some cleanup by Ranier
Vilela.
I'm sending Ranier's version here.
Nice Ronan.
But I think there is some confusion here.
The author is not you?
Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.
regards,
Ranier Vilela
Em ter., 6 de jul. de 2021 às 08:25, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:Hello,
While testing the patch "Add proper planner support for ORDER BY /
DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort
node
essentially came from not using the single-datum tuplesort optimization
in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1],
but
following James suggestion I'm making a separate thread just for this as
the
optimization is worthwhile independently of David's patch: it looks like
we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this
optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as
not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).The attached patch is originally by me, with some cleanup by Ranier
Vilela.
I'm sending Ranier's version here.Nice Ronan.
But I think there is some confusion here.
The author is not you?Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.
Sorry David, my mistake.
I confused the numbers (id) of Commitfest.
regards,
Ranier Vilela
Adding David since this patch is likely a precondition for [1].
On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Hello,
While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1], but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).
Thanks again for finding this and working up a patch.
I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.
First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.
That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?
A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?
A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.
Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.
Thanks,
James
1: /messages/by-id/CAApHDvpHzfo92=R4W0+xVua3BUYCKMckWAmo-2t_KiXN-wYH=w@mail.gmail.com
Em ter., 6 de jul. de 2021 às 10:19, James Coleman <jtc331@gmail.com>
escreveu:
Adding David since this patch is likely a precondition for [1].
On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io>
wrote:Hello,
While testing the patch "Add proper planner support for ORDER BY /
DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort
node
essentially came from not using the single-datum tuplesort optimization
in
ExecSort (contrary to the sorting done in ExecAgg).
I originally proposed this patch as a companion in the same thread [1],
but
following James suggestion I'm making a separate thread just for this as
the
optimization is worthwhile independently of David's patch: it looks like
we
can expect a 2x speedup on a "select a single ordered column" case.
The patch aimed to be as simple as possible: we only turn this
optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as
not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).Thanks again for finding this and working up a patch.
I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
No objection. And I agree that testing is complex and needs to remain as it
is.
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
I think yes, it can be better.
regards,
Ranier Vilela
On Tue, Jul 6, 2021 at 6:49 PM James Coleman <jtc331@gmail.com> wrote:
Adding David since this patch is likely a precondition for [1].
On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Hello,
While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1], but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).Thanks again for finding this and working up a patch.
I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?
I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Thank you for the review, I will address those shortly, but will answer some
questions in the meantime.
First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.
Done, since I lifted the restrictions following your questions, there isn't
much left to comment. (see below)
That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?
I had test failures when trying to move the Datum around when performing a
bounded sort, but did not look into it at first.
Now I've looked into it, and the switch to a heapsort when using bounded mode
just unconditionnaly tried to free a tuple that was never there to begin with.
So if the SortTuple does not contain an actual tuple, but only a single datum,
do not do that.
I've updated the patch to fix this and enable the optimization in the case of
bounded sort.
A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.
I've restricted the optimization to byval types because of the following
comment in nodeAgg.c:
/*
* Note: if input type is pass-by-ref, the datums returned by the
sort are
* freshly palloc'd in the per-query context, so we must be careful
to
* pfree them when they are no longer needed.
*/
As I was not sure how to handle that, I prefered the safety of not enabling
it. Since you both told me it should be safe, I've lifted that restriction
too.
A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
Done.
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
Agreed, done.
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.
Yes many test cases cover that, but maybe it would be better to explictly
check for it on some cases: do you think we could add a debug message that can
be checked for ?
Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.
I will look into it, but isn't incrementalsort used to sort tuples on several
keys, when they are already sorted on the first ? In that case, I doubt we
would ever have a single-valued tuple here, except if there is a projection to
strip the tuple from extraneous attributes.
--
Ronan Dunklau
Attachments:
v3-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v3-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+74-26
On Tue, Jul 6, 2021 at 11:03 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Thank you for the review, I will address those shortly, but will answer some
questions in the meantime.First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.Done, since I lifted the restrictions following your questions, there isn't
much left to comment. (see below)That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?I had test failures when trying to move the Datum around when performing a
bounded sort, but did not look into it at first.Now I've looked into it, and the switch to a heapsort when using bounded mode
just unconditionnaly tried to free a tuple that was never there to begin with.
So if the SortTuple does not contain an actual tuple, but only a single datum,
do not do that.I've updated the patch to fix this and enable the optimization in the case of
bounded sort.
Awesome.
A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.I've restricted the optimization to byval types because of the following
comment in nodeAgg.c:/*
* Note: if input type is pass-by-ref, the datums returned by the
sort are
* freshly palloc'd in the per-query context, so we must be careful
to
* pfree them when they are no longer needed.
*/As I was not sure how to handle that, I prefered the safety of not enabling
it. Since you both told me it should be safe, I've lifted that restriction
too.
To be clear, I don't know for certain it's safe [without extra work],
but even if it involves some extra manual pfree'ing (a la nodeAgg)
it's probably worth it. Maybe someone else will weigh in on whether or
not anything special is required here to ensure we don't leak memory
(I haven't looked in detail yet).
A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).Done.
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.Agreed, done.
Thanks
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.Yes many test cases cover that, but maybe it would be better to explictly
check for it on some cases: do you think we could add a debug message that can
be checked for ?
Mostly I think we should verify code coverage and _maybe_ add a
specific test or two that we know execises this path. I don't know
that the debug message needs to be matched in the test (probably more
pain than it's worth), but the debug message ("enabling datum sort
optimizaton" or similar) might be good anyway.
I wonder if we need to change costing of sorts for this case. I don't
like having to do so, but it's a significant change in speed, so
probably should impact what plan gets chosen. Hopefully others will
weigh on this also.
Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.I will look into it, but isn't incrementalsort used to sort tuples on several
keys, when they are already sorted on the first ? In that case, I doubt we
would ever have a single-valued tuple here, except if there is a projection to
strip the tuple from extraneous attributes.
Yes and no. When incremental sort has to do a full sort there will
always be at least 2 attributes. But in prefix sort mode (see
prefixsort_state) only non-presorted columns are sorted (i.e., if
given a,b already sorted by a, then only b is sorted). So the
prefixsort_state could use this optimization.
James
Le mardi 6 juillet 2021, 17:37:53 CEST James Coleman a écrit :
Yes and no. When incremental sort has to do a full sort there will
always be at least 2 attributes. But in prefix sort mode (see
prefixsort_state) only non-presorted columns are sorted (i.e., if
given a,b already sorted by a, then only b is sorted). So the
prefixsort_state could use this optimization.
The optimization is not when we actually sort on a single key, but when we get
a single attribute in / out of the tuplesort. Since sorting always add
resjunk entries for the keys being sorted on, I don't think we can ever end up
in a situation where the optimization would kick in, since the entries for the
already-performed-sort keys will need to be present in the output.
Maybe if instead of adding resjunk entries to the whole query's targetlist,
sort and incrementalsort nodes were able to do a projection from the input
(needed tle + resjunk sorting tle) to a tuple containing only the needed tle
on output before actually sorting it, it would be possible, but that would be
quite a big design change.
In the meantime I fixed some formatting issues, please find attached a new
patch.
--
Ronan Dunklau
Attachments:
v4-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v4-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+77-26
On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
In the meantime I fixed some formatting issues, please find attached a new
patch.
I started to look at this.
First I wondered how often we might be able to apply this
optimisation, so I ran make check after adding some elog(NOTICE) calls
to output which method is going to be used just before we do the
tuplestore_begin_* calls. It looks like there are 614 instances of
Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
223 of the 614 are byval types and the other 391 are byref. Not that
the regression tests are a good reflection of the real world, but if
it were then that's quite a good number of cases to be able to
optimise.
As for the patch, just a few things:
1. Can you add the missing braces in this if condition and the else
condition that belongs to it.
+ if (node->is_single_val)
+ for (;;)
+ {
2. I think it would nicer to name the new is_single_val field
"datumSort" instead. To me it seems more clear what it is for.
3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts. I think that maybe that should be fixed and
backpatched. I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().
static void
free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /*
+ * If the SortTuple is actually only a single Datum, which was not copied
+ * as it is a byval type, do not try to free it nor account for it in
+ * memory used.
+ */
+ if (stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }
I can take this to another thread.
That's all I have for now.
David
Le lundi 12 juillet 2021, 15:11:17 CEST David Rowley a écrit :
On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
In the meantime I fixed some formatting issues, please find attached a new
patch.I started to look at this.
Thank you ! I'm attaching a new version of the patch taking your remarks into
account.
First I wondered how often we might be able to apply this
optimisation, so I ran make check after adding some elog(NOTICE) calls
to output which method is going to be used just before we do the
tuplestore_begin_* calls. It looks like there are 614 instances of
Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
223 of the 614 are byval types and the other 391 are byref. Not that
the regression tests are a good reflection of the real world, but if
it were then that's quite a good number of cases to be able to
optimise.
That's an interesting stat.
As for the patch, just a few things:
1. Can you add the missing braces in this if condition and the else
condition that belongs to it.+ if (node->is_single_val) + for (;;) + {
Done.
2. I think it would nicer to name the new is_single_val field
"datumSort" instead. To me it seems more clear what it is for.
Done.
3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts. I think that maybe that should be fixed and
backpatched. I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().
I've kept this as-is for now, but I will remove it from my patch if it is
deemed worthy of back-patching in your other thread.
Regards,
--
Ronan Dunklau
Attachments:
v5-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v5-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+79-23
On Tue, 13 Jul 2021 at 01:59, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts. I think that maybe that should be fixed and
backpatched. I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().I've kept this as-is for now, but I will remove it from my patch if it is
deemed worthy of back-patching in your other thread.
I've now pushed that bug fix so it's fine to remove the change to
tuplesort.c now.
I also did a round of benchmarking on this patch using the attached
script. Anyone wanting to run it will need to run make installcheck
first to create the required tables.
On an AMD machine, I got the following results.
Result in transactions per second.
Test master v5 patch compare
Test1 446.1 657.3 147.32%
Test2 315.8 314.0 99.44%
Test3 302.3 392.1 129.67%
Test4 232.7 230.7 99.12%
Test5 230.0 446.1 194.00%
Test6 199.5 217.9 109.23%
Test7 188.7 185.3 98.21%
Test8 385.4 544.0 141.17%
Tests 2, 4, 7 are designed to check if there is any regression from
doing the additional run-time checks to see if we're doing datumSort.
I measured a very small penalty from this. It's most visible in test7
with a drop of about 1.8%. Each test did OFFSET 1000000 as I didn't
want to measure the overhead of outputting tuples.
All the other tests show a pretty good gain. Test6 is testing a byref
type, so it appears the gains are not just from byval datums.
It would be good to see the benchmark script run on a few other
machines to get an idea if the gains and losses are consistent.
In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode. But maybe the small
regression is not worth going to that trouble over. I'm not aware of
any other executor nodes that have logic like that so maybe it would
be a bad idea to introduce something like that.
David
I've now pushed that bug fix so it's fine to remove the change to
tuplesort.c now.
Thanks, I've rebased the patch, please find attached the v6.
I also did a round of benchmarking on this patch using the attached
script. Anyone wanting to run it will need to run make installcheck
first to create the required tables.
I've run your benchmark, keeping the best of three runs each time.
This is an intel laptop, so as many things are running on it there is a lot of
noise...
Both standard and patched run come from a compilation with gcc -O2. No changes
have been done to the default settings.
Query # Master Patched Variation
1 884 1627 184.05%
2 364 375 103.02%
3 568 783 137.85%
4 296 297 100.34%
5 421 484 114.96%
6 359 408 113.65%
7 237 251 105.91%
8 806 1271 157.69%
Since I didn't reproduce your slowdown at all on the first run, I tried to
rerun the benchmark several times and for the "dubious cases" (2, 4 and 7),
the results are too jittery to conclude one way or another in my case. I
don't have access to proper hardware, so not sure if that would be useful in
any way to just run the bench for thousands of xacts instead. I would be
surprised the check adds that much to the whole execution though.
I attach a graph similar to yours for reference.
--
Ronan Dunklau
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
I've now pushed that bug fix so it's fine to remove the change to
I would be
surprised the check adds that much to the whole execution though.
I think this branch is a misprediction.
In most cases is it not datumSort?
That's why I would like to use unlikely.
IMO all the tests should all be to verify past behavior first.
regards,
Ranier Vilela
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.
In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
David
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.
In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
Hum, I understand the usage cases now.
Thanks for the hint.
IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.
I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"
I don't have time to take the tests right now.
regards,
Ranier Vilela
Attachments:
allow_single_datum_sort.patchapplication/octet-stream; name=allow_single_datum_sort.patchDownload+69-21
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com>
escreveu:On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.Hum, I understand the usage cases now.
Thanks for the hint.IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"I don't have time to take the tests right now.
Finally I had time to benchmark (David's benchsort.sh)
ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.
HEAD v6 v7 v7b v6 vs
master v7 vs v6 v7b vs v6
Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%
I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.
If field struct datumSort is not absolutely necessary, I think that v7 will
be better.
Attached the patchs and file results.
Attachments:
bench_datumsort.txttext/plain; charset=US-ASCII; name=bench_datumsort.txtDownload
v7-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+59-12
v7b-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=US-ASCII; name=v7b-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+58-12
Em ter., 13 de jul. de 2021 às 14:42, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier.vf@gmail.com>
escreveu:Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com>
escreveu:On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.Hum, I understand the usage cases now.
Thanks for the hint.IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"I don't have time to take the tests right now.
Finally I had time to benchmark (David's benchsort.sh)
ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.HEAD v6 v7 v7b v6 vs
master v7 vs v6 v7b vs v6
Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.
v7b slowdown with test2 and test4, in relation with *v6*.
If field struct datumSort is not absolutely necessary, I think that v7
will be better.
*v7b* will be better.
Sorry for the noise.
regards,
Ranier Vilela
On Tue, 13 Jul 2021 at 15:15, David Rowley <dgrowleyml@gmail.com> wrote:
In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode.
Just to see how it would perform, I tried what I mentioned above. I've
included what I ended up with in the attached POC patch.
I got the following results on my AMD hardware.
Test master v8 patch comparison
Test1 448.0 671.7 149.9%
Test2 316.4 317.5 100.3%
Test3 299.5 381.6 127.4%
Test4 219.7 229.5 104.5%
Test5 226.3 254.6 112.5%
Test6 197.9 217.9 110.1%
Test7 179.2 185.3 103.4%
Test8 389.2 544.8 140.0%
This time I saw no regression on tests 2, 4 and 7.
I looked to see if there was anywhere else in the executor that
conditionally uses a different exec function in this way and found
nothing, so I'm not too sure if it's a good idea to start doing this.
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.
David
Attachments:
v8_poc_datum_tuple_sort.patchapplication/octet-stream; name=v8_poc_datum_tuple_sort.patchDownload+146-7
Em qua., 14 de jul. de 2021 às 07:14, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Tue, 13 Jul 2021 at 15:15, David Rowley <dgrowleyml@gmail.com> wrote:
In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode.Just to see how it would perform, I tried what I mentioned above. I've
included what I ended up with in the attached POC patch.I got the following results on my AMD hardware.
Test master v8 patch comparison
Test1 448.0 671.7 149.9%
Test2 316.4 317.5 100.3%
Test3 299.5 381.6 127.4%
Test4 219.7 229.5 104.5%
Test5 226.3 254.6 112.5%
Test6 197.9 217.9 110.1%
Test7 179.2 185.3 103.4%
Test8 389.2 544.8 140.0%
I'm a little surprised by your results.
Test1 and Test8 look pretty good to me.
What is compiler and environment?
I repeated (3 times) the benchmark with v8 here,
and the results were not good.
HEAD v6 v7b v8
v6 vs head v8 vs v6 v8 vs v7b
Test1 288,149636 449,018541 550,48505 468,168165 155,83% 104,26% 85,05%
Test2 94,766955 95,451406 94,718982 94,800275 100,72% 99,32% 100,09%
Test3 190,521319 260,279802 278,115296 262,538383 136,61% 100,87% 94,40%
Test4 78,779344 78,253455 77,941482 78,471546 99,33% 100,28% 100,68%
Test5 131,362614 142,662223 149,639041 144,849303 108,60% 101,53% 96,80%
Test6 112,884298 124,181671 127,58497 124,29376 110,01% 100,09% 97,42%
Test7 69,308587 68,643067 69,087544 69,437312 99,04% 101,16% 100,51%
Test8 243,674171 364,681142 419,259703 369,239176 149,66% 101,25% 88,07%
This time I saw no regression on tests 2, 4 and 7.
I looked to see if there was anywhere else in the executor that
conditionally uses a different exec function in this way and found
nothing, so I'm not too sure if it's a good idea to start doing this.
Specialized functions can be a way to optimize. The compilers themselves do
it.
But the ExecSortTuple and ExecSortDatum are much more similar,
which can cause maintenance problems.
I don't think in this case it would be a good idea.
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.
Yeah, another different machine.
I would like to see other results with v7b.
Attached the file with all results from v8.
regards,
Ranier Vilela