[PATCH] Use optimized single-datum tuplesort in ExecSort
Hello,
While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0]https://commitfest.postgresql.org/33/3164/ I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).
I originally proposed this patch as a companion in the same thread [1]/messages/by-id/4480689.ObhdGn8bVM@aivenronan, but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.
The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).
The attached patch is originally by me, with some cleanup by Ranier Vilela.
I'm sending Ranier's version here.
[0]: https://commitfest.postgresql.org/33/3164/
[1]: /messages/by-id/4480689.ObhdGn8bVM@aivenronan
--
Ronan Dunklau
Attachments:
v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..363f1d90f0 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -85,16 +85,28 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
-
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ if (unlikely(tupDesc->natts == 1 &&
+ !node->bounded &&
+ TupleDescAttr(tupDesc, 0)->attbyval))
+ {
+ node->is_single_val = true;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ } else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
@@ -103,15 +115,27 @@ ExecSort(PlanState *pstate)
* Scan the subplan and feed all the tuples to tuplesort.
*/
- for (;;)
- {
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
- }
+ if (!node->is_single_val)
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
+ else
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
/*
* Complete the sort.
@@ -150,9 +174,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->is_single_val)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -191,6 +224,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->is_single_val = false;
/*
* Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..643f416c54 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool is_single_val; /* are we using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
Hello,
While testing the patch "Add proper planner support for ORDER BY /
DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort
node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1],
but
following James suggestion I'm making a separate thread just for this as
the
optimization is worthwhile independently of David's patch: it looks like
we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this
optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as
not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).The attached patch is originally by me, with some cleanup by Ranier
Vilela.
I'm sending Ranier's version here.
Nice Ronan.
But I think there is some confusion here.
The author is not you?
Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.
regards,
Ranier Vilela
Em ter., 6 de jul. de 2021 às 08:25, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em ter., 6 de jul. de 2021 às 03:15, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:Hello,
While testing the patch "Add proper planner support for ORDER BY /
DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort
node
essentially came from not using the single-datum tuplesort optimization
in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1],
but
following James suggestion I'm making a separate thread just for this as
the
optimization is worthwhile independently of David's patch: it looks like
we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this
optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as
not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).The attached patch is originally by me, with some cleanup by Ranier
Vilela.
I'm sending Ranier's version here.Nice Ronan.
But I think there is some confusion here.
The author is not you?Just to clarify, at Commitfest, it was supposed to be the other way around.
You as an author and David as a reviewer.
I'll put myself as a reviewer too.
Sorry David, my mistake.
I confused the numbers (id) of Commitfest.
regards,
Ranier Vilela
Adding David since this patch is likely a precondition for [1].
On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Hello,
While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1], but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).
Thanks again for finding this and working up a patch.
I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.
First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.
That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?
A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?
A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.
Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.
Thanks,
James
1: /messages/by-id/CAApHDvpHzfo92=R4W0+xVua3BUYCKMckWAmo-2t_KiXN-wYH=w@mail.gmail.com
Em ter., 6 de jul. de 2021 às 10:19, James Coleman <jtc331@gmail.com>
escreveu:
Adding David since this patch is likely a precondition for [1].
On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io>
wrote:Hello,
While testing the patch "Add proper planner support for ORDER BY /
DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort
node
essentially came from not using the single-datum tuplesort optimization
in
ExecSort (contrary to the sorting done in ExecAgg).
I originally proposed this patch as a companion in the same thread [1],
but
following James suggestion I'm making a separate thread just for this as
the
optimization is worthwhile independently of David's patch: it looks like
we
can expect a 2x speedup on a "select a single ordered column" case.
The patch aimed to be as simple as possible: we only turn this
optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as
not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).Thanks again for finding this and working up a patch.
I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
No objection. And I agree that testing is complex and needs to remain as it
is.
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
I think yes, it can be better.
regards,
Ranier Vilela
On Tue, Jul 6, 2021 at 6:49 PM James Coleman <jtc331@gmail.com> wrote:
Adding David since this patch is likely a precondition for [1].
On Tue, Jul 6, 2021 at 2:15 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Hello,
While testing the patch "Add proper planner support for ORDER BY / DISTINCT
aggregates" [0] I discovered the performance penalty from adding a sort node
essentially came from not using the single-datum tuplesort optimization in
ExecSort (contrary to the sorting done in ExecAgg).I originally proposed this patch as a companion in the same thread [1], but
following James suggestion I'm making a separate thread just for this as the
optimization is worthwhile independently of David's patch: it looks like we
can expect a 2x speedup on a "select a single ordered column" case.The patch aimed to be as simple as possible: we only turn this optimization on
when the tuple being sorted has only one attribute, it is "byval" (so as not
to incur copies which would be hard to track in the execution tree) and
unbound (again, not having to deal with copying borrowed datum anywhere).Thanks again for finding this and working up a patch.
I've taken a look, and while I haven't dug into testing it yet, I have
a few comments.First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?
I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Thank you for the review, I will address those shortly, but will answer some
questions in the meantime.
First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.
Done, since I lifted the restrictions following your questions, there isn't
much left to comment. (see below)
That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?
I had test failures when trying to move the Datum around when performing a
bounded sort, but did not look into it at first.
Now I've looked into it, and the switch to a heapsort when using bounded mode
just unconditionnaly tried to free a tuple that was never there to begin with.
So if the SortTuple does not contain an actual tuple, but only a single datum,
do not do that.
I've updated the patch to fix this and enable the optimization in the case of
bounded sort.
A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.
I've restricted the optimization to byval types because of the following
comment in nodeAgg.c:
/*
* Note: if input type is pass-by-ref, the datums returned by the
sort are
* freshly palloc'd in the per-query context, so we must be careful
to
* pfree them when they are no longer needed.
*/
As I was not sure how to handle that, I prefered the safety of not enabling
it. Since you both told me it should be safe, I've lifted that restriction
too.
A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).
Done.
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.
Agreed, done.
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.
Yes many test cases cover that, but maybe it would be better to explictly
check for it on some cases: do you think we could add a debug message that can
be checked for ?
Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.
I will look into it, but isn't incrementalsort used to sort tuples on several
keys, when they are already sorted on the first ? In that case, I doubt we
would ever have a single-valued tuple here, except if there is a projection to
strip the tuple from extraneous attributes.
--
Ronan Dunklau
Attachments:
v3-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v3-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..ff443d15a9 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -85,33 +89,59 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
-
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one key,
+ * as it is much more efficient especially when the type is pass-by-value.
+ */
+ if (tupDesc->natts == 1)
+ {
+ node->is_single_val = true;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ } else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort,
+ * using either the _putdatum or _puttupleslot API as appropriate.
*/
-
- for (;;)
- {
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
- }
+ if (node->is_single_val)
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ else
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
/*
* Complete the sort.
@@ -150,9 +180,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->is_single_val)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -191,6 +230,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->is_single_val = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 22972071ff..226a723167 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4773,6 +4773,13 @@ leader_takeover_tapes(Tuplesortstate *state)
static void
free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /* If the SortTuple is actually only a single Datum, which was not copied as
+ * it is a byval type, do not try to free it nor account for it in memory
+ * used.
+ */
+ if(stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..643f416c54 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool is_single_val; /* are we using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
On Tue, Jul 6, 2021 at 11:03 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Thank you for the review, I will address those shortly, but will answer some
questions in the meantime.First, the changes are lacking any explanatory comments. Probably we
should follow how nodeAgg does this and add both comments to the
ExecSort function header as well as specific comments above the "if"
around the new tuplesort_begin_datum explaining the specific
conditions that are required for the optimization to be useful and
safe.Done, since I lifted the restrictions following your questions, there isn't
much left to comment. (see below)That leads to a question I had: I don't follow why bounded mode (when
using byval) needs to be excluded. Comments should be added if there's
a good reason (as noted above), but maybe it's a case we can handle
safely?I had test failures when trying to move the Datum around when performing a
bounded sort, but did not look into it at first.Now I've looked into it, and the switch to a heapsort when using bounded mode
just unconditionnaly tried to free a tuple that was never there to begin with.
So if the SortTuple does not contain an actual tuple, but only a single datum,
do not do that.I've updated the patch to fix this and enable the optimization in the case of
bounded sort.
Awesome.
A second question: at first glance it's intuitively the case we might
not be able to handle byref values. But nodeAgg doesn't seem to have
that restriction. What's the difference here?I think tuplesort_begin_datum, doesn't have any such limitation, it
can handle any type of Datum so I think we don't need to consider the
only attbyval, we can consider any type of attribute for this
optimization.I've restricted the optimization to byval types because of the following
comment in nodeAgg.c:/*
* Note: if input type is pass-by-ref, the datums returned by the
sort are
* freshly palloc'd in the per-query context, so we must be careful
to
* pfree them when they are no longer needed.
*/As I was not sure how to handle that, I prefered the safety of not enabling
it. Since you both told me it should be safe, I've lifted that restriction
too.
To be clear, I don't know for certain it's safe [without extra work],
but even if it involves some extra manual pfree'ing (a la nodeAgg)
it's probably worth it. Maybe someone else will weigh in on whether or
not anything special is required here to ensure we don't leak memory
(I haven't looked in detail yet).
A few small code observations:
- In my view the addition of unlikely() in ExecSort is unlikely to be
of benefit because it's a single call for the entire node's execution
(not in the tuple loop).Done.
- It seems clearer to change the "if (!node->is_single_val)" to flip
the true/false cases so we don't need the negation.Agreed, done.
Thanks
- I assume there are tests that likely already cover this case, but
it'd be worth verifying that.Yes many test cases cover that, but maybe it would be better to explictly
check for it on some cases: do you think we could add a debug message that can
be checked for ?
Mostly I think we should verify code coverage and _maybe_ add a
specific test or two that we know execises this path. I don't know
that the debug message needs to be matched in the test (probably more
pain than it's worth), but the debug message ("enabling datum sort
optimizaton" or similar) might be good anyway.
I wonder if we need to change costing of sorts for this case. I don't
like having to do so, but it's a significant change in speed, so
probably should impact what plan gets chosen. Hopefully others will
weigh on this also.
Finally, I believe the same optimization likely ought to be added to
nodeIncrementalSort. It's less likely the tests there are sufficient
for both this and the original case, but we'll see.I will look into it, but isn't incrementalsort used to sort tuples on several
keys, when they are already sorted on the first ? In that case, I doubt we
would ever have a single-valued tuple here, except if there is a projection to
strip the tuple from extraneous attributes.
Yes and no. When incremental sort has to do a full sort there will
always be at least 2 attributes. But in prefix sort mode (see
prefixsort_state) only non-presorted columns are sorted (i.e., if
given a,b already sorted by a, then only b is sorted). So the
prefixsort_state could use this optimization.
James
Le mardi 6 juillet 2021, 17:37:53 CEST James Coleman a écrit :
Yes and no. When incremental sort has to do a full sort there will
always be at least 2 attributes. But in prefix sort mode (see
prefixsort_state) only non-presorted columns are sorted (i.e., if
given a,b already sorted by a, then only b is sorted). So the
prefixsort_state could use this optimization.
The optimization is not when we actually sort on a single key, but when we get
a single attribute in / out of the tuplesort. Since sorting always add
resjunk entries for the keys being sorted on, I don't think we can ever end up
in a situation where the optimization would kick in, since the entries for the
already-performed-sort keys will need to be present in the output.
Maybe if instead of adding resjunk entries to the whole query's targetlist,
sort and incrementalsort nodes were able to do a projection from the input
(needed tle + resjunk sorting tle) to a tuple containing only the needed tle
on output before actually sorting it, it would be possible, but that would be
quite a big design change.
In the meantime I fixed some formatting issues, please find attached a new
patch.
--
Ronan Dunklau
Attachments:
v4-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v4-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..9d8b0a77da 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -86,32 +90,61 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (tupDesc->natts == 1)
+ {
+ node->is_single_val = true;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
+ else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort, using either
+ * the _putdatum or _puttupleslot API as appropriate.
*/
-
- for (;;)
- {
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
- }
+ if (node->is_single_val)
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ else
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
/*
* Complete the sort.
@@ -150,9 +183,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->is_single_val)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -191,6 +233,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->is_single_val = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 22972071ff..f785259da7 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4773,6 +4773,14 @@ leader_takeover_tapes(Tuplesortstate *state)
static void
free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /*
+ * If the SortTuple is actually only a single Datum, which was not copied
+ * as it is a byval type, do not try to free it nor account for it in
+ * memory used.
+ */
+ if (stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..643f416c54 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool is_single_val; /* are we using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
--
2.32.0
On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
In the meantime I fixed some formatting issues, please find attached a new
patch.
I started to look at this.
First I wondered how often we might be able to apply this
optimisation, so I ran make check after adding some elog(NOTICE) calls
to output which method is going to be used just before we do the
tuplestore_begin_* calls. It looks like there are 614 instances of
Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
223 of the 614 are byval types and the other 391 are byref. Not that
the regression tests are a good reflection of the real world, but if
it were then that's quite a good number of cases to be able to
optimise.
As for the patch, just a few things:
1. Can you add the missing braces in this if condition and the else
condition that belongs to it.
+ if (node->is_single_val)
+ for (;;)
+ {
2. I think it would nicer to name the new is_single_val field
"datumSort" instead. To me it seems more clear what it is for.
3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts. I think that maybe that should be fixed and
backpatched. I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().
static void
free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /*
+ * If the SortTuple is actually only a single Datum, which was not copied
+ * as it is a byval type, do not try to free it nor account for it in
+ * memory used.
+ */
+ if (stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }
I can take this to another thread.
That's all I have for now.
David
Le lundi 12 juillet 2021, 15:11:17 CEST David Rowley a écrit :
On Wed, 7 Jul 2021 at 21:32, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
In the meantime I fixed some formatting issues, please find attached a new
patch.I started to look at this.
Thank you ! I'm attaching a new version of the patch taking your remarks into
account.
First I wondered how often we might be able to apply this
optimisation, so I ran make check after adding some elog(NOTICE) calls
to output which method is going to be used just before we do the
tuplestore_begin_* calls. It looks like there are 614 instances of
Datum sorts and 4223 of tuple sorts. That's about 14.5% datum sorts.
223 of the 614 are byval types and the other 391 are byref. Not that
the regression tests are a good reflection of the real world, but if
it were then that's quite a good number of cases to be able to
optimise.
That's an interesting stat.
As for the patch, just a few things:
1. Can you add the missing braces in this if condition and the else
condition that belongs to it.+ if (node->is_single_val) + for (;;) + {
Done.
2. I think it would nicer to name the new is_single_val field
"datumSort" instead. To me it seems more clear what it is for.
Done.
3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts. I think that maybe that should be fixed and
backpatched. I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().
I've kept this as-is for now, but I will remove it from my patch if it is
deemed worthy of back-patching in your other thread.
Regards,
--
Ronan Dunklau
Attachments:
v5-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v5-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
commit 91210952cf435995c9db6b539bf29f6fd744be70
Author: Ronan Dunklau <ronan.dunklau@aiven.io>
Date: Tue Jul 6 16:34:31 2021 +0200
Use optimized single-datum tuplesort in ExecSort
Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..df4e79c6ba 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -86,31 +90,64 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (tupDesc->natts == 1)
+ {
+ node->datumSort = true;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
+ else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort, using either
+ * the _putdatum or _puttupleslot API as appropriate.
*/
-
- for (;;)
+ if (node->datumSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
}
/*
@@ -150,9 +187,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->datumSort)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -191,6 +237,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->datumSort = false;
/*
* Miscellaneous initialization
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 22972071ff..f785259da7 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4773,6 +4773,14 @@ leader_takeover_tapes(Tuplesortstate *state)
static void
free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
- FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
- pfree(stup->tuple);
+ /*
+ * If the SortTuple is actually only a single Datum, which was not copied
+ * as it is a byval type, do not try to free it nor account for it in
+ * memory used.
+ */
+ if (stup->tuple)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ pfree(stup->tuple);
+ }
}
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..89f944a9c4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool datumSort; /* are we using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
On Tue, 13 Jul 2021 at 01:59, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
3. This seems to be a bug fix where byval datum sorts do not properly
handle bounded sorts. I think that maybe that should be fixed and
backpatched. I don't see anything that says Datum sorts can't be
bounded and if there were some restriction on that I'd expect
tuplesort_set_bound() to fail when the Tuplesortstate had been set up
with tuplesort_begin_datum().I've kept this as-is for now, but I will remove it from my patch if it is
deemed worthy of back-patching in your other thread.
I've now pushed that bug fix so it's fine to remove the change to
tuplesort.c now.
I also did a round of benchmarking on this patch using the attached
script. Anyone wanting to run it will need to run make installcheck
first to create the required tables.
On an AMD machine, I got the following results.
Result in transactions per second.
Test master v5 patch compare
Test1 446.1 657.3 147.32%
Test2 315.8 314.0 99.44%
Test3 302.3 392.1 129.67%
Test4 232.7 230.7 99.12%
Test5 230.0 446.1 194.00%
Test6 199.5 217.9 109.23%
Test7 188.7 185.3 98.21%
Test8 385.4 544.0 141.17%
Tests 2, 4, 7 are designed to check if there is any regression from
doing the additional run-time checks to see if we're doing datumSort.
I measured a very small penalty from this. It's most visible in test7
with a drop of about 1.8%. Each test did OFFSET 1000000 as I didn't
want to measure the overhead of outputting tuples.
All the other tests show a pretty good gain. Test6 is testing a byref
type, so it appears the gains are not just from byval datums.
It would be good to see the benchmark script run on a few other
machines to get an idea if the gains and losses are consistent.
In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode. But maybe the small
regression is not worth going to that trouble over. I'm not aware of
any other executor nodes that have logic like that so maybe it would
be a bad idea to introduce something like that.
David
I've now pushed that bug fix so it's fine to remove the change to
tuplesort.c now.
Thanks, I've rebased the patch, please find attached the v6.
I also did a round of benchmarking on this patch using the attached
script. Anyone wanting to run it will need to run make installcheck
first to create the required tables.
I've run your benchmark, keeping the best of three runs each time.
This is an intel laptop, so as many things are running on it there is a lot of
noise...
Both standard and patched run come from a compilation with gcc -O2. No changes
have been done to the default settings.
Query # Master Patched Variation
1 884 1627 184.05%
2 364 375 103.02%
3 568 783 137.85%
4 296 297 100.34%
5 421 484 114.96%
6 359 408 113.65%
7 237 251 105.91%
8 806 1271 157.69%
Since I didn't reproduce your slowdown at all on the first run, I tried to
rerun the benchmark several times and for the "dubious cases" (2, 4 and 7),
the results are too jittery to conclude one way or another in my case. I
don't have access to proper hardware, so not sure if that would be useful in
any way to just run the bench for thousands of xacts instead. I would be
surprised the check adds that much to the whole execution though.
I attach a graph similar to yours for reference.
--
Ronan Dunklau
Attachments:
v6-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v6-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
commit 01885709d1718710db1665bf2b3f39d83c720147
Author: Ronan Dunklau <ronan.dunklau@aiven.io>
Date: Tue Jul 6 16:34:31 2021 +0200
Use optimized single-datum tuplesort in ExecSort
Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..df4e79c6ba 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -86,31 +90,64 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (tupDesc->natts == 1)
+ {
+ node->datumSort = true;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
+ else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort, using either
+ * the _putdatum or _puttupleslot API as appropriate.
*/
-
- for (;;)
+ if (node->datumSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
}
/*
@@ -150,9 +187,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->datumSort)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -191,6 +237,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->datumSort = false;
/*
* Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..89f944a9c4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool datumSort; /* are we using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
I've now pushed that bug fix so it's fine to remove the change to
I would be
surprised the check adds that much to the whole execution though.
I think this branch is a misprediction.
In most cases is it not datumSort?
That's why I would like to use unlikely.
IMO all the tests should all be to verify past behavior first.
regards,
Ranier Vilela
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.
In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
David
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.
In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.
Hum, I understand the usage cases now.
Thanks for the hint.
IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.
I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"
I don't have time to take the tests right now.
regards,
Ranier Vilela
Attachments:
allow_single_datum_sort.patchapplication/octet-stream; name=allow_single_datum_sort.patchDownload
commit 01885709d1718710db1665bf2b3f39d83c720147
Author: Ronan Dunklau <ronan.dunklau@aiven.io>
Date: Tue Jul 6 16:34:31 2021 +0200
Use optimized single-datum tuplesort in ExecSort
Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..df4e79c6ba 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -86,31 +90,64 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (tupDesc->natts != 1)
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
+ else
+ {
+ node->tupleSort = false;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort, using either
+ * the _putdatum or _puttupleslot API as appropriate.
*/
-
- for (;;)
+ if (node->tupleSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
}
/*
@@ -150,9 +187,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->tupleSort)
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+ else
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+
return slot;
}
@@ -191,6 +237,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->tupleSort = true;
/*
* Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..89f944a9c4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool tupleSort; /* are we not using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com>
escreveu:On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.Hum, I understand the usage cases now.
Thanks for the hint.IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"I don't have time to take the tests right now.
Finally I had time to benchmark (David's benchsort.sh)
ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.
HEAD v6 v7 v7b v6 vs
master v7 vs v6 v7b vs v6
Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%
I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.
If field struct datumSort is not absolutely necessary, I think that v7 will
be better.
Attached the patchs and file results.
Attachments:
v7-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..e1adecafaa 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -86,7 +90,13 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (tupDesc->natts != 1)
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
plannode->numCols,
plannode->sortColIdx,
plannode->sortOperators,
@@ -95,23 +105,49 @@ ExecSort(PlanState *pstate)
work_mem,
NULL,
node->randomAccess);
+ else
+ {
+ node->tupleSort = false;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
+
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
++ * Scan the subplan and feed all the tuples to tuplesort, using either
++ * the _putdatum or _puttupleslot API as appropriate.
*/
- for (;;)
- {
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
- }
+ if (node->tupleSort)
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
+ else
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
/*
* Complete the sort.
@@ -150,9 +186,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
+ if (node->tupleSort)
+ (void) tuplesort_gettupleslot(tuplesortstate,
ScanDirectionIsForward(dir),
false, slot, NULL);
+ else
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+
return slot;
}
@@ -191,6 +236,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->tupleSort = true;
/*
* Miscellaneous initialization
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..4c41ab80b1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool tupleSort; /* are we *not* using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
v7b-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=US-ASCII; name=v7b-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..7be852273e 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -44,6 +48,7 @@ ExecSort(PlanState *pstate)
ScanDirection dir;
Tuplesortstate *tuplesortstate;
TupleTableSlot *slot;
+ bool tupleSort = true;
CHECK_FOR_INTERRUPTS();
@@ -86,7 +91,13 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (tupDesc->natts != 1)
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
plannode->numCols,
plannode->sortColIdx,
plannode->sortOperators,
@@ -95,23 +106,49 @@ ExecSort(PlanState *pstate)
work_mem,
NULL,
node->randomAccess);
+ else
+ {
+ tupleSort = false;
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
+
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
++ * Scan the subplan and feed all the tuples to tuplesort, using either
++ * the _putdatum or _puttupleslot API as appropriate.
*/
- for (;;)
- {
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
- }
+ if (tupleSort)
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
+ else
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
/*
* Complete the sort.
@@ -150,9 +187,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
+ if (tupleSort)
+ (void) tuplesort_gettupleslot(tuplesortstate,
ScanDirectionIsForward(dir),
false, slot, NULL);
+ else
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+
return slot;
}
Em ter., 13 de jul. de 2021 às 14:42, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Em ter., 13 de jul. de 2021 às 09:44, Ranier Vilela <ranier.vf@gmail.com>
escreveu:Em ter., 13 de jul. de 2021 às 09:24, David Rowley <dgrowleyml@gmail.com>
escreveu:On Wed, 14 Jul 2021 at 00:06, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em ter., 13 de jul. de 2021 às 04:19, Ronan Dunklau <
ronan.dunklau@aiven.io> escreveu:
I would be
surprised the check adds that much to the whole execution though.I think this branch is a misprediction.
It could be. I wondered that myself when I saw Ronan's results were
better than mine for 2,4 and 7. However, I think Ronan had quite a
bit of noise in his results as there's no reason for the speedup in
tests 2,4 and 7.In most cases is it not datumSort?
who knows. Maybe someone's workload always requires the datum sort.
That's why I would like to use unlikely.
We really only use unlikely() in cases where we want to move code out
of line to a cold area because it's really never executed under normal
circumstances. We tend to do that for ERROR cases as we don't ever
really want to optimise for errors. We also sometimes do it when some
function has a branch to initialise something during the first call.
The case in question here does not fit for either of those two cases.Hum, I understand the usage cases now.
Thanks for the hint.IMO all the tests should all be to verify past behavior first.
I'm not quire sure what you mean there.
I'm saying we could help the branch by keeping the same testing logic as
before and not reversing it.
Attached is a version to demonstrate this, I don't pretend to be v7.I couldn't find a good phrase to the contrary:
"are we *not* using the single value optimization ?"I don't have time to take the tests right now.
Finally I had time to benchmark (David's benchsort.sh)
ubuntu 64 bits (20.04) 8gb ram SSD 256GB.
Table with the best results of each.HEAD v6 v7 v7b v6 vs
master v7 vs v6 v7b vs v6
Test1 288,149636 449,018541 469,757169 550,48505 155,83% 104,62% 122,60%
Test2 94,766955 95,451406 94,556249 94,718982 100,72% 99,06% 99,23%
Test3 190,521319 260,279802 259,597067 278,115296 136,61% 99,74% 106,85%
Test4 78,779344 78,253455 78,114068 77,941482 99,33% 99,82% 99,60%
Test5 131,362614 142,662223 136,436347 149,639041 108,60% 95,64% 104,89%
Test6 112,884298 124,181671 115,528328 127,58497 110,01% 93,03% 102,74%
Test7 69,308587 68,643067 66,10195 69,087544 99,04% 96,30% 100,65%
Test8 243,674171 364,681142 371,928453 419,259703 149,66% 101,99% 114,97%I have no idea why v7 failed with test6?
v6 slowdown with test4 and test7.
v7b slowdown with test2 and test4, in relation with v7.
v7b slowdown with test2 and test4, in relation with *v6*.
If field struct datumSort is not absolutely necessary, I think that v7
will be better.
*v7b* will be better.
Sorry for the noise.
regards,
Ranier Vilela
On Tue, 13 Jul 2021 at 15:15, David Rowley <dgrowleyml@gmail.com> wrote:
In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode.
Just to see how it would perform, I tried what I mentioned above. I've
included what I ended up with in the attached POC patch.
I got the following results on my AMD hardware.
Test master v8 patch comparison
Test1 448.0 671.7 149.9%
Test2 316.4 317.5 100.3%
Test3 299.5 381.6 127.4%
Test4 219.7 229.5 104.5%
Test5 226.3 254.6 112.5%
Test6 197.9 217.9 110.1%
Test7 179.2 185.3 103.4%
Test8 389.2 544.8 140.0%
This time I saw no regression on tests 2, 4 and 7.
I looked to see if there was anywhere else in the executor that
conditionally uses a different exec function in this way and found
nothing, so I'm not too sure if it's a good idea to start doing this.
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.
David
Attachments:
v8_poc_datum_tuple_sort.patchapplication/octet-stream; name=v8_poc_datum_tuple_sort.patchDownload
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..f65755d019 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,16 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * There are two distinct ways that we perform this sort:
+ *
+ * 1) When the result is a single column we perform a Datum sort.
+ *
+ * 2) When the result contains multiple columns we perform a tuple sort.
+ *
+ * We could do this by always performing a tuple sort, however sorting
+ * Datums only can be significantly faster than sorting tuples,
+ * especially when the Datums are of a byval type.
+ *
* Conditions:
* -- none.
*
@@ -37,7 +47,7 @@
* ----------------------------------------------------------------
*/
static TupleTableSlot *
-ExecSort(PlanState *pstate)
+ExecSortTuple(PlanState *pstate)
{
SortState *node = castNode(SortState, pstate);
EState *estate;
@@ -50,7 +60,7 @@ ExecSort(PlanState *pstate)
/*
* get state info from node
*/
- SO1_printf("ExecSort: %s\n",
+ SO1_printf("ExecSortTuple: %s\n",
"entering routine");
estate = node->ss.ps.state;
@@ -68,7 +78,7 @@ ExecSort(PlanState *pstate)
PlanState *outerNode;
TupleDesc tupDesc;
- SO1_printf("ExecSort: %s\n",
+ SO1_printf("ExecSortTuple: %s\n",
"sorting subplan");
/*
@@ -80,7 +90,7 @@ ExecSort(PlanState *pstate)
/*
* Initialize tuplesort module.
*/
- SO1_printf("ExecSort: %s\n",
+ SO1_printf("ExecSortTuple: %s\n",
"calling tuplesort_begin");
outerNode = outerPlanState(node);
@@ -138,10 +148,10 @@ ExecSort(PlanState *pstate)
si = &node->shared_info->sinstrument[ParallelWorkerNumber];
tuplesort_get_stats(tuplesortstate, si);
}
- SO1_printf("ExecSort: %s\n", "sorting done");
+ SO1_printf("ExecSortTuple: %s\n", "sorting done");
}
- SO1_printf("ExecSort: %s\n",
+ SO1_printf("ExecSortTuple: %s\n",
"retrieving tuple from tuplesort");
/*
@@ -156,6 +166,127 @@ ExecSort(PlanState *pstate)
return slot;
}
+static TupleTableSlot *
+ExecSortDatum(PlanState *pstate)
+{
+ SortState *node = castNode(SortState, pstate);
+ EState *estate;
+ ScanDirection dir;
+ Tuplesortstate *tuplesortstate;
+ TupleTableSlot *slot;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * get state info from node
+ */
+ SO1_printf("ExecSort: %s\n",
+ "entering routine");
+
+ estate = node->ss.ps.state;
+ dir = estate->es_direction;
+ tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+
+ /*
+ * If first time through, read all tuples from outer plan and pass them to
+ * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
+ */
+
+ if (!node->sort_Done)
+ {
+ Sort *plannode = (Sort *) node->ss.ps.plan;
+ PlanState *outerNode;
+ TupleDesc tupDesc;
+
+ SO1_printf("ExecSortDatum: %s\n",
+ "sorting subplan");
+
+ /*
+ * Want to scan subplan in the forward direction while creating the
+ * sorted data.
+ */
+ estate->es_direction = ForwardScanDirection;
+
+ /*
+ * Initialize tuplesort module.
+ */
+ SO1_printf("ExecSortDatum: %s\n",
+ "calling tuplesort_begin");
+
+ outerNode = outerPlanState(node);
+ tupDesc = ExecGetResultType(outerNode);
+
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ if (node->bounded)
+ tuplesort_set_bound(tuplesortstate, node->bound);
+ node->tuplesortstate = (void *) tuplesortstate;
+
+ /*
+ * Scan the subplan and feed all the tuples to tuplesort.
+ */
+
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+
+ /*
+ * Complete the sort.
+ */
+ tuplesort_performsort(tuplesortstate);
+
+ /*
+ * restore to user specified direction
+ */
+ estate->es_direction = dir;
+
+ /*
+ * finally set the sorted flag to true
+ */
+ node->sort_Done = true;
+ node->bounded_Done = node->bounded;
+ node->bound_Done = node->bound;
+ if (node->shared_info && node->am_worker)
+ {
+ TuplesortInstrumentation *si;
+
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+ si = &node->shared_info->sinstrument[ParallelWorkerNumber];
+ tuplesort_get_stats(tuplesortstate, si);
+ }
+ SO1_printf("ExecSortDatum: %s\n", "sorting done");
+ }
+
+ SO1_printf("ExecSortDatum: %s\n",
+ "retrieving datum from tuplesort");
+
+ /*
+ * Get the first or next datum from tuplesort. We clear the slot before
+ * doing this. This means when we run out of datums to get that the
+ * function returns false and the slot is left empty.
+ */
+ slot = node->ss.ps.ps_ResultTupleSlot;
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ return slot;
+}
+
/* ----------------------------------------------------------------
* ExecInitSort
*
@@ -177,7 +308,6 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate = makeNode(SortState);
sortstate->ss.ps.plan = (Plan *) node;
sortstate->ss.ps.state = estate;
- sortstate->ss.ps.ExecProcNode = ExecSort;
/*
* We must have random access to the sort output to do backward scan or
@@ -221,6 +351,15 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
ExecInitResultTupleSlotTL(&sortstate->ss.ps, &TTSOpsMinimalTuple);
sortstate->ss.ps.ps_ProjInfo = NULL;
+ /*
+ * We perform a Datum sort when we're sorting just a single column,
+ * otherwise we perform a tuple sort.
+ */
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
+ sortstate->ss.ps.ExecProcNode = ExecSortDatum;
+ else
+ sortstate->ss.ps.ExecProcNode = ExecSortTuple;
+
SO1_printf("ExecInitSort: %s\n",
"sort node initialized");
Em qua., 14 de jul. de 2021 às 07:14, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Tue, 13 Jul 2021 at 15:15, David Rowley <dgrowleyml@gmail.com> wrote:
In theory, we likely could get rid of the small regression by having
two versions of ExecSort() and setting the correct one during
ExecInitSort() by setting the function pointer to the version we want
to use in sortstate->ss.ps.ExecProcNode.Just to see how it would perform, I tried what I mentioned above. I've
included what I ended up with in the attached POC patch.I got the following results on my AMD hardware.
Test master v8 patch comparison
Test1 448.0 671.7 149.9%
Test2 316.4 317.5 100.3%
Test3 299.5 381.6 127.4%
Test4 219.7 229.5 104.5%
Test5 226.3 254.6 112.5%
Test6 197.9 217.9 110.1%
Test7 179.2 185.3 103.4%
Test8 389.2 544.8 140.0%
I'm a little surprised by your results.
Test1 and Test8 look pretty good to me.
What is compiler and environment?
I repeated (3 times) the benchmark with v8 here,
and the results were not good.
HEAD v6 v7b v8
v6 vs head v8 vs v6 v8 vs v7b
Test1 288,149636 449,018541 550,48505 468,168165 155,83% 104,26% 85,05%
Test2 94,766955 95,451406 94,718982 94,800275 100,72% 99,32% 100,09%
Test3 190,521319 260,279802 278,115296 262,538383 136,61% 100,87% 94,40%
Test4 78,779344 78,253455 77,941482 78,471546 99,33% 100,28% 100,68%
Test5 131,362614 142,662223 149,639041 144,849303 108,60% 101,53% 96,80%
Test6 112,884298 124,181671 127,58497 124,29376 110,01% 100,09% 97,42%
Test7 69,308587 68,643067 69,087544 69,437312 99,04% 101,16% 100,51%
Test8 243,674171 364,681142 419,259703 369,239176 149,66% 101,25% 88,07%
This time I saw no regression on tests 2, 4 and 7.
I looked to see if there was anywhere else in the executor that
conditionally uses a different exec function in this way and found
nothing, so I'm not too sure if it's a good idea to start doing this.
Specialized functions can be a way to optimize. The compilers themselves do
it.
But the ExecSortTuple and ExecSortDatum are much more similar,
which can cause maintenance problems.
I don't think in this case it would be a good idea.
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.
Yeah, another different machine.
I would like to see other results with v7b.
Attached the file with all results from v8.
regards,
Ranier Vilela
Attachments:
On Wed, Jul 14, 2021 at 6:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.
I tested this on an older Xeon, gcc 8.4 (here median of each test, full
results attached):
test HEAD v6 v8
Test1 588 1007 998
Test2 198 202 197
Test3 374 516 512
Test4 172 165 166
Test5 255 279 283
Test6 227 251 251
Test7 145 147 146
Test8 474 783 770
Test4 could be a regression, but 2 and 7 look fine here.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
I repeated (3 times) the benchmark with v8 here,
and the results were not good.
Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?
David
Em qua., 14 de jul. de 2021 às 20:43, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
I repeated (3 times) the benchmark with v8 here,
and the results were not good.Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?
Branch Predictions works with *more* probable path,
otherwise a penalty occurs and the cpu must revert the results.
In this case it seems to me that most of the time, tuplesort is the path.
So as it is tested if it is *datumSort* and the *prediction* fails,
the cpu has more work to reverse the wrong path.
To help the branch, test a more probable case first, anywhere.
if, switch, etc.
Another gain is the local variable tupleSort, which is obviously faster
than node.
regards,
Ranier Vilela
On Thu, 15 Jul 2021 at 12:10, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 20:43, David Rowley <dgrowleyml@gmail.com> escreveu:
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
I repeated (3 times) the benchmark with v8 here,
and the results were not good.Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?Branch Predictions works with *more* probable path,
otherwise a penalty occurs and the cpu must revert the results.
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.
It seems much more likely to me that the results were just noisy. It
would be good to see if you can recreate them consistently.
David
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Thu, 15 Jul 2021 at 12:10, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 20:43, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Thu, 15 Jul 2021 at 05:55, Ranier Vilela <ranier.vf@gmail.com>
wrote:
I repeated (3 times) the benchmark with v8 here,
and the results were not good.Do you have any good theories on why the additional branching that's
done in v7b vs v8 might cause it to run faster?Branch Predictions works with *more* probable path,
otherwise a penalty occurs and the cpu must revert the results.But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.
In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
datumSort is tested first.
Cpu time is a more expensive resource.
Always is executed two branches, if it is right path, win,
otherwise occurs a penalty time.
It seems much more likely to me that the results were just noisy. It
would be good to see if you can recreate them consistently.
I do.
Can you please share results with v7b?
regards,
Ranier Vilela
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
You do know that branch is in a function that's only executed once
during executor initialization, right?
David
Em qua., 14 de jul. de 2021 às 22:22, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com>
escreveu:
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)You do know that branch is in a function that's only executed once
during executor initialization, right?
The branch prediction should work better.
I have no idea why it works worse.
I redid all tests:
notebook 8GB RAM 256GB SSD
ubuntu 64 bits (20.04)
clang-12
powerhigh (charger on)
none configuration (all defaults)
HEAD v6 v7b v8 v6
vs head
v7b vs v6 v8 vs v7b
Test1 576,868013 940,947236 1090,253859 1016,0443 163,11% 115,87% 93,19%
Test2 184,748363 177,6254 177,346229 178,230258 96,14% 99,84% 100,50%
Test3 410,030055 541,889704 605,843924 534,946166 132,16% 111,80% 88,30%
Test4 153,331752 147,98418 148,010894 147,771155 96,51% 100,02% 99,84%
Test5 268,97555 301,979647 316,928492 300,94932 112,27% 104,95% 94,96%
Test6 234,910125 259,71483 269,851427 260,567637 110,56% 103,90% 96,56%
Test7 142,704153 136,09163 136,802695 136,935709 95,37% 100,52% 100,10%
Test8 498,634855 763,482151 867,350046 804,833884 153,11% 113,60% 92,79%
The values are high here, because now, the tests are made with full power
of cpu to all patchs!
I think that more testing is needed with v7b and v8.
Anyway, two functions (ExecSortTuple and ExecSortDatum) are almost equal,
maybe not a good idea.
file results attached.
regards,
Ranier Vilela
Attachments:
Le jeudi 15 juillet 2021, 01:30:26 CEST John Naylor a écrit :
On Wed, Jul 14, 2021 at 6:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.
Hello,
Thank you for trying this approach in v8 David !
I've decided to test on more "stable" hardware, an EC-2 medium instance,
compiling with Debian's gcc 8.3. That's still not ideal but a lot better than
a laptop.
To gather more meaningful results, I ran every pgbench for 30s instead of the
10 in the initial script provided by David. I ran the full script once for
HEAD, v6, v8, then a second time for HEAD, v6, v8 to try to eliminate noise
that could happen for 90 consecutive seconds, and took for each of those the
median of the 6 runs. It's much less noisy than my previous runs but still
not as as stable as I'd like to.
The results are attached in graph form, as well as the raw data if someone
wants it.
As a conclusion, I don't think it's worth it to introduce a separate
execprocnode function for that case. It is likely the minor difference still
observed can be explained to noise, as they fluctuate if you compare the min,
max, average or median values from the results.
Best regards,
--
Ronan Dunklau
Em qui., 15 de jul. de 2021 às 07:18, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
Le jeudi 15 juillet 2021, 01:30:26 CEST John Naylor a écrit :
On Wed, Jul 14, 2021 at 6:14 AM David Rowley <dgrowleyml@gmail.com>
wrote:
It would be good to get a 2nd opinion about this idea. Also, more
benchmark results with v6 and v8 would be good too.Hello,
Thank you for trying this approach in v8 David !
I've decided to test on more "stable" hardware, an EC-2 medium instance,
compiling with Debian's gcc 8.3. That's still not ideal but a lot better
than
a laptop.To gather more meaningful results, I ran every pgbench for 30s instead of
the
10 in the initial script provided by David. I ran the full script once for
HEAD, v6, v8, then a second time for HEAD, v6, v8 to try to eliminate
noise
that could happen for 90 consecutive seconds, and took for each of those
the
median of the 6 runs. It's much less noisy than my previous runs but
still
not as as stable as I'd like to.The results are attached in graph form, as well as the raw data if someone
wants it.As a conclusion, I don't think it's worth it to introduce a separate
execprocnode function for that case. It is likely the minor difference
still
observed can be explained to noise, as they fluctuate if you compare the
min,
max, average or median values from the results.
Is there a special reason to not share v7b tests and results?
IMHO he is much more branch friendly.
regards,
Ranier Vilela
Le jeudi 15 juillet 2021, 14:09:26 CEST Ranier Vilela a écrit :
Is there a special reason to not share v7b tests and results?
The v7b patch is wrong, as it loses the type of tuplesort being used and as
such always tries to fetch results using tuplesort_gettupleslot after the first
tuple is fetched.
--
Ronan Dunklau
Em qui., 15 de jul. de 2021 às 09:27, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
Le jeudi 15 juillet 2021, 14:09:26 CEST Ranier Vilela a écrit :
Is there a special reason to not share v7b tests and results?
The v7b patch is wrong, as it loses the type of tuplesort being used
I don't see 'node->datumSort' being anywhere else yet.
and as
such always tries to fetch results using tuplesort_gettupleslot after the
first
tuple is fetched.
Is that why it is faster than v6?
regards,
Ranier Vilela
On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)You do know that branch is in a function that's only executed once
during executor initialization, right?
This is why I have a hard time believing there's a "real" change here
and not the result of either noise or something not really
controllable like executable layout changing.
James
On Fri, 16 Jul 2021 at 01:44, James Coleman <jtc331@gmail.com> wrote:
On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)You do know that branch is in a function that's only executed once
during executor initialization, right?This is why I have a hard time believing there's a "real" change here
and not the result of either noise or something not really
controllable like executable layout changing.
Yeah, I think we likely are at the level where layout changes in the
compiled code are going to make things hard to measure. I just want
to make sure we're not going to end up with some regression that's
actual and not random depending on layout changes of unrelated code.
I think a branch that's taken consistently *should* be predicted
correctly each time.
Anyway, I think all the comparisons with v7b can safely be ignored. As
Ronan pointed out, v7b has some issues due to it not recording the
sort method in the executor state that leads to it forgetting which
method it used once we start pulling tuples from it. The reproductions
of that are it calling tuplesort_gettupleslot() from the 2nd tuple
onwards regardless of if we've done a datum or tuple sort.
Ronan's latest results plus John's make me think there's no need to
separate out the node function as I did in v8. However, I do think v6
could learn a little from v8. I think I'd rather see the sort method
determined in ExecInitSort() rather than ExecSort(). I think
minimising those few extra instructions in ExecSort() might help the
L1 instruction cache.
David
Em qua., 14 de jul. de 2021 às 22:22, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com>
escreveu:
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)You do know that branch is in a function that's only executed once
during executor initialization, right?
There's a real difference between v8 and v6, if I understood correctly.
v6 the branches is per tuple:
+ if (tupDesc->natts == 1)
v8 the branches is per state:
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
I think that a big different way to solve the problem.
Or am I getting it wrong?
If the sortstate number of attributes is equal to 1, is it worth the same
for each tuple?
Can you explain this, please?
regards,
Ranier Vilela
Em qui., 15 de jul. de 2021 às 11:19, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Fri, 16 Jul 2021 at 01:44, James Coleman <jtc331@gmail.com> wrote:
On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com>
wrote:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com>
wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <
dgrowleyml@gmail.com> escreveu:
But, in v8 there is no additional branch, so no branch to
mispredict.
I don't really see how your explanation fits.
In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)You do know that branch is in a function that's only executed once
during executor initialization, right?This is why I have a hard time believing there's a "real" change here
and not the result of either noise or something not really
controllable like executable layout changing.Yeah, I think we likely are at the level where layout changes in the
compiled code are going to make things hard to measure. I just want
to make sure we're not going to end up with some regression that's
actual and not random depending on layout changes of unrelated code.
I think a branch that's taken consistently *should* be predicted
correctly each time.
Anyway, I think all the comparisons with v7b can safely be ignored. As
Ronan pointed out, v7b has some issues due to it not recording the
sort method in the executor state that leads to it forgetting which
method it used once we start pulling tuples from it. The reproductions
of that are it calling tuplesort_gettupleslot() from the 2nd tuple
onwards regardless of if we've done a datum or tuple sort.
Sorry for insisting on this.
Assuming v7b is doing it the wrong way, which I still don't think it is.
Why is it still faster than v6 and v8?
regards,
Ranier Vilela
Le jeudi 15 juillet 2021, 16:19:23 CEST David Rowley a écrit :>
Ronan's latest results plus John's make me think there's no need to
separate out the node function as I did in v8. However, I do think v6
could learn a little from v8. I think I'd rather see the sort method
determined in ExecInitSort() rather than ExecSort(). I think
minimising those few extra instructions in ExecSort() might help the
L1 instruction cache.
I'm not sure I understand what you expect from moving that to ExecInitSort ?
Maybe we should also implement the tuplesort_state initialization in
ExecInitSort ? (not the actual feeding and sorting of course).
Please find attached a v9 just moving the flag setting to ExecInitSort, and my
apologies if I misunderstood your point.
--
Ronan Dunklau
Attachments:
v9-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=v9-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload
commit 2bf8db23c8ddb53f7dbcba33c60350bda9d703ea
Author: Ronan Dunklau <ronan.dunklau@aiven.io>
Date: Tue Jul 6 16:34:31 2021 +0200
Use optimized single-datum tuplesort in ExecSort
Previously, with optimization was only done in ExecAgg.
Doing this optimization on regular sort nodes provides for a nice
performance improvement, when the input / output tuple has only one
attribute.
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..861e4c6e9e 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,10 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * The tuplesort can either occur on the whole tuple (this is the nominal
+ * case) or, when the input / output tuple consists of only one attribute,
+ * we switch to the tuplesort_*_datum API, optimized for that specific case.
+ *
* Conditions:
* -- none.
*
@@ -86,31 +90,63 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ /*
+ * Switch to the tuplesort_*_datum interface when we have only one
+ * key, as it is much more efficient especially when the type is
+ * pass-by-value.
+ */
+ if (node->datumSort)
+ {
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ }
+ else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort, using either
+ * the _putdatum or _puttupleslot API as appropriate.
*/
-
- for (;;)
+ if (node->datumSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
}
/*
@@ -150,9 +186,18 @@ ExecSort(PlanState *pstate)
* next fetch from the tuplesort.
*/
slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->datumSort)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -191,6 +236,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
sortstate->bounded = false;
sortstate->sort_Done = false;
sortstate->tuplesortstate = NULL;
+ sortstate->datumSort = false;
/*
* Miscellaneous initialization
@@ -220,6 +266,15 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
*/
ExecInitResultTupleSlotTL(&sortstate->ss.ps, &TTSOpsMinimalTuple);
sortstate->ss.ps.ps_ProjInfo = NULL;
+
+ /*
+ * We perform a Datum sort when we're sorting just a single column,
+ * otherwise we perform a tuple sort.
+ */
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
+ sortstate->datumSort = true;
+ else
+ sortstate->datumSort = false;
SO1_printf("ExecInitSort: %s\n",
"sort node initialized");
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0ec5509e7e..89f944a9c4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool datumSort; /* are we using the single value optimization ? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
On Thu, Jul 15, 2021 at 10:19 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 16 Jul 2021 at 01:44, James Coleman <jtc331@gmail.com> wrote:
On Wed, Jul 14, 2021 at 9:22 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 15 Jul 2021 at 12:30, Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 14 de jul. de 2021 às 21:21, David Rowley <dgrowleyml@gmail.com> escreveu:
But, in v8 there is no additional branch, so no branch to mispredict.
I don't really see how your explanation fits.In v8 the branch occurs at :
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)You do know that branch is in a function that's only executed once
during executor initialization, right?This is why I have a hard time believing there's a "real" change here
and not the result of either noise or something not really
controllable like executable layout changing.Yeah, I think we likely are at the level where layout changes in the
compiled code are going to make things hard to measure. I just want
to make sure we're not going to end up with some regression that's
actual and not random depending on layout changes of unrelated code.
I think a branch that's taken consistently *should* be predicted
correctly each time.Anyway, I think all the comparisons with v7b can safely be ignored. As
Ronan pointed out, v7b has some issues due to it not recording the
sort method in the executor state that leads to it forgetting which
method it used once we start pulling tuples from it. The reproductions
of that are it calling tuplesort_gettupleslot() from the 2nd tuple
onwards regardless of if we've done a datum or tuple sort.Ronan's latest results plus John's make me think there's no need to
separate out the node function as I did in v8. However, I do think v6
could learn a little from v8. I think I'd rather see the sort method
determined in ExecInitSort() rather than ExecSort(). I think
minimising those few extra instructions in ExecSort() might help the
L1 instruction cache.
I ran master/v6/v8 tests for 90s each with David's test script on an
AWS c5n.metal instance (so should be immune to noise neighbor issues).
Here are comparative results:
Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
v6 68.66% 0.05% 32.21% -0.83% 12.58% 10.42% -1.48% 50.98%
v8 69.78% -0.44% 32.45% -1.11% 12.01% 10.58% -1.40% 49.30%
So I see a consistent change in the data, but I don't really see a
good explanation for it not being noise. Can't prove that yet though.
James
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Le jeudi 15 juillet 2021, 16:19:23 CEST David Rowley a écrit :>
Ronan's latest results plus John's make me think there's no need to
separate out the node function as I did in v8. However, I do think v6
could learn a little from v8. I think I'd rather see the sort method
determined in ExecInitSort() rather than ExecSort(). I think
minimising those few extra instructions in ExecSort() might help the
L1 instruction cache.I'm not sure I understand what you expect from moving that to ExecInitSort ?
The motivation was to reduce the extra code that's being added to
ExecSort. I checked the assembly of ExecSort on v6 and v9 and v6 was
544 lines of assembly and v9 is 534 lines.
Maybe we should also implement the tuplesort_state initialization in
ExecInitSort ? (not the actual feeding and sorting of course).
I don't think that would be a good idea. Setting the datumSort does
not require any new memory to be allocated. That's not the case for
the tuplesort_begin routines. The difference here is that we can
delay the memory allocation until we pull the first tuple and if we
don't pull any tuples from the outer node then there are no needless
allocations.
Please find attached a v9 just moving the flag setting to ExecInitSort, and my
apologies if I misunderstood your point.
That's exactly what I meant. Thanks
David
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Please find attached a v9 just moving the flag setting to ExecInitSort, and my
apologies if I misunderstood your point.
I took this and adjusted a few things and ended up with the attached patch.
The changes are fairly minor. I made the bracing consistent between
both tuplesort_begin calls. I rewrote the comment at the top of
ExecSort() to make it more clear about each method used.
I also adjusted the comment down at the end of ExecSort that was
mentioning something about tuplesort_gettupleslot returning NULL.
Your patch didn't touch this, but to me, the comment just looked wrong
both before and after the changes. tuplesort_gettupleslot returns
false and sets the slot to empty when it runs out of tuples. Anyway,
I wrote something there that I think improves that.
I feel like this patch is commit-worthy now. However, I'll leave it
for a few days, maybe until after the weekend as there's been a fair
bit of interest and I imagine someone will have comments to make.
David
Attachments:
v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patchapplication/octet-stream; name=v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patchDownload
From 254806136ae380fb49837856a2fc621117ec9e2a Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 16 Jul 2021 15:24:05 +1200
Subject: [PATCH v10] Make nodeSort.c do Datum sorts for single column sorts
Datum sorts can be significantly cheaper than tuple sorts, especially when
the data type being sorted is a pass-by-value data type.
---
src/backend/executor/nodeSort.c | 106 ++++++++++++++++++++++++--------
src/include/nodes/execnodes.h | 1 +
2 files changed, 82 insertions(+), 25 deletions(-)
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..c68d9e41c2 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,16 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * There are two distinct ways that this sort can be performed:
+ *
+ * 1) When the result is a single column we perform a Datum sort.
+ *
+ * 2) When the result contains multiple columns we perform a tuple sort.
+ *
+ * We could do this by always performing a tuple sort, however sorting
+ * Datums only can be significantly faster than sorting tuples,
+ * especially when the Datums are of a pass-by-value type.
+ *
* Conditions:
* -- none.
*
@@ -86,31 +96,56 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ if (node->datumSort)
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort using the
+ * appropriate method based on the type of sort we're doing.
*/
-
- for (;;)
+ if (node->datumSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
}
/*
@@ -144,15 +179,27 @@ ExecSort(PlanState *pstate)
SO1_printf("ExecSort: %s\n",
"retrieving tuple from tuplesort");
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
/*
- * Get the first or next tuple from tuplesort. Returns NULL if no more
- * tuples. Note that we only rely on slot tuple remaining valid until the
- * next fetch from the tuplesort.
+ * Fetch the next sorted item from the appropriate tuplesort function. For
+ * datum sorts we must manage the slot ourselves and leave it clear when
+ * tuplesort_getdatum returns false to indicate there are no more datums.
+ * For tuple sorts, tuplesort_gettupleslot manages the slot for us and
+ * empties the slot when it runs out of tuples.
*/
- slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->datumSort)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -221,6 +268,15 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
ExecInitResultTupleSlotTL(&sortstate->ss.ps, &TTSOpsMinimalTuple);
sortstate->ss.ps.ps_ProjInfo = NULL;
+ /*
+ * We perform a Datum sort when we're sorting just a single column,
+ * otherwise we perform a tuple sort.
+ */
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
+ sortstate->datumSort = true;
+ else
+ sortstate->datumSort = false;
+
SO1_printf("ExecInitSort: %s\n",
"sort node initialized");
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 105180764e..6bf650fdf3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool datumSort; /* Datum sort instead of tuple sort? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
--
2.30.2
Em sex., 16 de jul. de 2021 às 00:45, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io>
wrote:Please find attached a v9 just moving the flag setting to ExecInitSort,
and my
apologies if I misunderstood your point.
I took this and adjusted a few things and ended up with the attached patch.
The changes are fairly minor. I made the bracing consistent between
both tuplesort_begin calls. I rewrote the comment at the top of
ExecSort() to make it more clear about each method used.
With relation to the braces, it's still not clear to me which style to
follow.
I gave Ronan directions about it.
And I think maybe, it's still not clear when to use it or not.
I also adjusted the comment down at the end of ExecSort that was
mentioning something about tuplesort_gettupleslot returning NULL.
Your patch didn't touch this, but to me, the comment just looked wrong
both before and after the changes. tuplesort_gettupleslot returns
false and sets the slot to empty when it runs out of tuples. Anyway,
I wrote something there that I think improves that.
Can help a little here, but, seems good to me.
I feel like this patch is commit-worthy now. However, I'll leave it
for a few days, maybe until after the weekend as there's been a fair
bit of interest and I imagine someone will have comments to make.
A little lack of time.
But I finally can understand v7b.
Really struct field is necessary and he fails with the next tuple, ok.
The only conclusion I can come to is that he is faster because he fails to
sort correctly.
It's no use being faster and getting wrong results.
So, +1 from me to commit v10.
Thanks for working together.
Ranier Vilela
On Thu, Jul 15, 2021 at 11:45 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Please find attached a v9 just moving the flag setting to ExecInitSort, and my
apologies if I misunderstood your point.I took this and adjusted a few things and ended up with the attached patch.
The changes are fairly minor. I made the bracing consistent between
both tuplesort_begin calls. I rewrote the comment at the top of
ExecSort() to make it more clear about each method used.I also adjusted the comment down at the end of ExecSort that was
mentioning something about tuplesort_gettupleslot returning NULL.
Your patch didn't touch this, but to me, the comment just looked wrong
both before and after the changes. tuplesort_gettupleslot returns
false and sets the slot to empty when it runs out of tuples. Anyway,
I wrote something there that I think improves that.I feel like this patch is commit-worthy now. However, I'll leave it
for a few days, maybe until after the weekend as there's been a fair
bit of interest and I imagine someone will have comments to make.
The only remaining question I have is whether or not costing needs to
change, given the very significant speedup for datum sort.
James
On Sat, 17 Jul 2021 at 01:14, James Coleman <jtc331@gmail.com> wrote:
The only remaining question I have is whether or not costing needs to
change, given the very significant speedup for datum sort.
I'm looking at cost_tuplesort and the only thing that I think might
make sense would be to adjust how the input_bytes value is calculated.
For now, that's done with the following function that's used in quite
a number of places.
static double
relation_byte_size(double tuples, int width)
{
return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
}
It seems, at least in the case of Sort, that using SizeofHeapTupleHead
is just always wrong as it should be SizeofMinimalTupleHeader. I know
that's also the case for Memoize too. I've not checked the other
locations.
The only thing I can really see that we might do would be not add the
MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
We'd need to pass down the number of attributes from
create_sort_path() so we'd know when and when not to add that. I'm not
saying that we should do this. I'm just saying that I don't really see
what else we might do.
I can imagine another patch might just want to do a complete overhaul
of all locations that use relation_byte_size(). There are various
things that function just does not account for. e.g, the fact that we
allocate chunks in powers of 2 and that there's a chunk header added
on. Of course, "width" is just an estimate, so maybe trying to
calculate something too precisely wouldn't be too wise. However,
there's a bit of a chicken and the egg problem there as there'd be
little incentive to improve "width" unless we started making more
accurate use of the value.
Anyway, none of the above take into account that the Datum sort is
just a little faster, The only thing that exists in the existing cost
modal that we could use to adjust the cost of an in memory sort is the
comparison_cost. The problem there is that the comparison is exactly
the same in both Datum and Tuple sorts. The only thing that really
changes between Datum and Tuple sort is the fact that we don't make a
MinimalTuple when doing a Datum sort. The cost modal, unfortunately,
does not account for that. That kinda makes me think that we should
do nothing as if we start to account for making MemoryTuples then
we'll just penalise Tuple sorts and that might cause someone to be
upset.
David
Le samedi 17 juillet 2021, 10:36:09 CEST David Rowley a écrit :
On Sat, 17 Jul 2021 at 01:14, James Coleman <jtc331@gmail.com> wrote:
The only remaining question I have is whether or not costing needs to
change, given the very significant speedup for datum sort.I'm looking at cost_tuplesort and the only thing that I think might
make sense would be to adjust how the input_bytes value is calculated.
For now, that's done with the following function that's used in quite
a number of places.static double
relation_byte_size(double tuples, int width)
{
return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
}It seems, at least in the case of Sort, that using SizeofHeapTupleHead
is just always wrong as it should be SizeofMinimalTupleHeader. I know
that's also the case for Memoize too. I've not checked the other
locations.The only thing I can really see that we might do would be not add the
MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
We'd need to pass down the number of attributes from
create_sort_path() so we'd know when and when not to add that. I'm not
saying that we should do this. I'm just saying that I don't really see
what else we might do.I can imagine another patch might just want to do a complete overhaul
of all locations that use relation_byte_size(). There are various
things that function just does not account for. e.g, the fact that we
allocate chunks in powers of 2 and that there's a chunk header added
on. Of course, "width" is just an estimate, so maybe trying to
calculate something too precisely wouldn't be too wise. However,
there's a bit of a chicken and the egg problem there as there'd be
little incentive to improve "width" unless we started making more
accurate use of the value.Anyway, none of the above take into account that the Datum sort is
just a little faster, The only thing that exists in the existing cost
modal that we could use to adjust the cost of an in memory sort is the
comparison_cost. The problem there is that the comparison is exactly
the same in both Datum and Tuple sorts. The only thing that really
changes between Datum and Tuple sort is the fact that we don't make a
MinimalTuple when doing a Datum sort. The cost modal, unfortunately,
does not account for that. That kinda makes me think that we should
do nothing as if we start to account for making MemoryTuples then
we'll just penalise Tuple sorts and that might cause someone to be
upset.
Thank you for taking the time to perform that analysis. I agree with you and
tt looks to me that if we were to start accounting for it, we would have to
make the change almost transparent for tuple sorts so that it stays roughly
the same, which is impossible since we don't apply the comparison cost to all
tuples but only to the number of tuples we actually expect to compare.
On the other hand, if we don't change the sorting cost and it just ends up
being faster in some cases I doubt anyone would complain.
--
Ronan Dunklau
On Sat, Jul 17, 2021 at 4:36 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 17 Jul 2021 at 01:14, James Coleman <jtc331@gmail.com> wrote:
The only remaining question I have is whether or not costing needs to
change, given the very significant speedup for datum sort.I'm looking at cost_tuplesort and the only thing that I think might
make sense would be to adjust how the input_bytes value is calculated.
For now, that's done with the following function that's used in quite
a number of places.static double
relation_byte_size(double tuples, int width)
{
return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
}It seems, at least in the case of Sort, that using SizeofHeapTupleHead
is just always wrong as it should be SizeofMinimalTupleHeader. I know
that's also the case for Memoize too. I've not checked the other
locations.The only thing I can really see that we might do would be not add the
MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
We'd need to pass down the number of attributes from
create_sort_path() so we'd know when and when not to add that. I'm not
saying that we should do this. I'm just saying that I don't really see
what else we might do.I can imagine another patch might just want to do a complete overhaul
of all locations that use relation_byte_size(). There are various
things that function just does not account for. e.g, the fact that we
allocate chunks in powers of 2 and that there's a chunk header added
on. Of course, "width" is just an estimate, so maybe trying to
calculate something too precisely wouldn't be too wise. However,
there's a bit of a chicken and the egg problem there as there'd be
little incentive to improve "width" unless we started making more
accurate use of the value.Anyway, none of the above take into account that the Datum sort is
just a little faster, The only thing that exists in the existing cost
modal that we could use to adjust the cost of an in memory sort is the
comparison_cost. The problem there is that the comparison is exactly
the same in both Datum and Tuple sorts. The only thing that really
changes between Datum and Tuple sort is the fact that we don't make a
MinimalTuple when doing a Datum sort. The cost modal, unfortunately,
does not account for that. That kinda makes me think that we should
do nothing as if we start to account for making MemoryTuples then
we'll just penalise Tuple sorts and that might cause someone to be
upset.
To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.
But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?
James
On Tue, 20 Jul 2021 at 01:10, James Coleman <jtc331@gmail.com> wrote:
To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.
Thanks for making that clear.
But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?
I failed to notice that last part that adds the additional cpu_operator_cost.
The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.
If we take test 1 from my results on v5 as an example:
Test1 446.1 657.3 147.32%
Looking at explain for that query:
regression=# explain select two from tenk1 order by two offset 1000000;
QUERY PLAN
----------------------------------------------------------------------
Limit (cost=1133.95..1133.95 rows=1 width=4)
-> Sort (cost=1108.97..1133.95 rows=9995 width=4)
Sort Key: two
-> Seq Scan on tenk1 (cost=0.00..444.95 rows=9995 width=4)
(4 rows)
If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.
If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45. That's not great considering the actual
performance indicates that 769.72 would be a better number.
If we look at John's result for test 1: He saw 588 tps on master and
998 on v8. 1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.
My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with. I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.
I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.
It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted. Does anyone else feel that we should be
adjusting it?
David
Em ter., 20 de jul. de 2021 às 05:35, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Tue, 20 Jul 2021 at 01:10, James Coleman <jtc331@gmail.com> wrote:
To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.Thanks for making that clear.
But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?I failed to notice that last part that adds the additional
cpu_operator_cost.The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.If we take test 1 from my results on v5 as an example:
Test1 446.1 657.3 147.32%
Looking at explain for that query:
regression=# explain select two from tenk1 order by two offset 1000000;
QUERY PLAN
----------------------------------------------------------------------
Limit (cost=1133.95..1133.95 rows=1 width=4)
-> Sort (cost=1108.97..1133.95 rows=9995 width=4)
Sort Key: two
-> Seq Scan on tenk1 (cost=0.00..444.95 rows=9995 width=4)
(4 rows)If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45. That's not great considering the actual
performance indicates that 769.72 would be a better number.If we look at John's result for test 1: He saw 588 tps on master and
998 on v8. 1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with. I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.
I understand that redesign would require a lot of work,
but why not do it step by step?
I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.
I think one first step is naming this 2.0?
Does this magic number don't have a good name?
It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted. Does anyone else feel that we should be
adjusting it?
I took a look at cost_tuplesort and I think that some small adjustments
could be made as part of the improvement process.
It is attached.
1. long is a very problematic type; better int64?
2. 1024 can be int, not long?
3. 2 changed all to 2.0 (double)?
4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?
Finally, to at least document (add comments) those conclusions,
would be nice, wouldn't it?
regards,
Ranier Vilela
Attachments:
costsize.patchapplication/octet-stream; name=costsize.patchDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8577c7b138..0798c8da0b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1796,10 +1796,9 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
Cost comparison_cost, int sort_mem,
double limit_tuples)
{
- double input_bytes = relation_byte_size(tuples, width);
double output_bytes;
double output_tuples;
- long sort_mem_bytes = sort_mem * 1024L;
+ int64 sort_mem_bytes = sort_mem * 1024;
/*
* We want to be sure the cost of a sort is never estimated as zero, even
@@ -1812,7 +1811,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
comparison_cost += 2.0 * cpu_operator_cost;
/* Do we have a useful LIMIT? */
- if (limit_tuples > 0 && limit_tuples < tuples)
+ if (limit_tuples > 0.0 && limit_tuples < tuples)
{
output_tuples = limit_tuples;
output_bytes = relation_byte_size(output_tuples, width);
@@ -1820,7 +1819,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
else
{
output_tuples = tuples;
- output_bytes = input_bytes;
+ output_bytes = relation_byte_size(tuples, width);
}
if (output_bytes > sort_mem_bytes)
@@ -1828,6 +1827,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
/*
* We'll have to use a disk-based sort of all the tuples
*/
+ double input_bytes = relation_byte_size(tuples, width);
double npages = ceil(input_bytes / BLCKSZ);
double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
@@ -1853,7 +1853,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
*startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);
}
- else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
+ else if (tuples > 2.0 * output_tuples || input_bytes > sort_mem_bytes)
{
/*
* We'll use a bounded heap-sort keeping just K tuples in memory, forOn Fri, 16 Jul 2021 at 15:44, David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 16 Jul 2021 at 02:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Please find attached a v9 just moving the flag setting to ExecInitSort, and my
apologies if I misunderstood your point.I took this and adjusted a few things and ended up with the attached patch.
Attaching the same v10 patch again so the CF bot picks up the correct
patch again.
David
Attachments:
v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patchapplication/octet-stream; name=v10-0001-Make-nodeSort.c-do-Datum-sorts-for-single-column.patchDownload
From 254806136ae380fb49837856a2fc621117ec9e2a Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Fri, 16 Jul 2021 15:24:05 +1200
Subject: [PATCH v10] Make nodeSort.c do Datum sorts for single column sorts
Datum sorts can be significantly cheaper than tuple sorts, especially when
the data type being sorted is a pass-by-value data type.
---
src/backend/executor/nodeSort.c | 106 ++++++++++++++++++++++++--------
src/include/nodes/execnodes.h | 1 +
2 files changed, 82 insertions(+), 25 deletions(-)
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b99027e0d7..c68d9e41c2 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -29,6 +29,16 @@
* which saves the results in a temporary file or memory. After the
* initial call, returns a tuple from the file with each call.
*
+ * There are two distinct ways that this sort can be performed:
+ *
+ * 1) When the result is a single column we perform a Datum sort.
+ *
+ * 2) When the result contains multiple columns we perform a tuple sort.
+ *
+ * We could do this by always performing a tuple sort, however sorting
+ * Datums only can be significantly faster than sorting tuples,
+ * especially when the Datums are of a pass-by-value type.
+ *
* Conditions:
* -- none.
*
@@ -86,31 +96,56 @@ ExecSort(PlanState *pstate)
outerNode = outerPlanState(node);
tupDesc = ExecGetResultType(outerNode);
- tuplesortstate = tuplesort_begin_heap(tupDesc,
- plannode->numCols,
- plannode->sortColIdx,
- plannode->sortOperators,
- plannode->collations,
- plannode->nullsFirst,
- work_mem,
- NULL,
- node->randomAccess);
+ if (node->datumSort)
+ tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
+ plannode->sortOperators[0],
+ plannode->collations[0],
+ plannode->nullsFirst[0],
+ work_mem,
+ NULL,
+ node->randomAccess);
+ else
+ tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plannode->numCols,
+ plannode->sortColIdx,
+ plannode->sortOperators,
+ plannode->collations,
+ plannode->nullsFirst,
+ work_mem,
+ NULL,
+ node->randomAccess);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
/*
- * Scan the subplan and feed all the tuples to tuplesort.
+ * Scan the subplan and feed all the tuples to tuplesort using the
+ * appropriate method based on the type of sort we're doing.
*/
-
- for (;;)
+ if (node->datumSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
}
/*
@@ -144,15 +179,27 @@ ExecSort(PlanState *pstate)
SO1_printf("ExecSort: %s\n",
"retrieving tuple from tuplesort");
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
/*
- * Get the first or next tuple from tuplesort. Returns NULL if no more
- * tuples. Note that we only rely on slot tuple remaining valid until the
- * next fetch from the tuplesort.
+ * Fetch the next sorted item from the appropriate tuplesort function. For
+ * datum sorts we must manage the slot ourselves and leave it clear when
+ * tuplesort_getdatum returns false to indicate there are no more datums.
+ * For tuple sorts, tuplesort_gettupleslot manages the slot for us and
+ * empties the slot when it runs out of tuples.
*/
- slot = node->ss.ps.ps_ResultTupleSlot;
- (void) tuplesort_gettupleslot(tuplesortstate,
- ScanDirectionIsForward(dir),
- false, slot, NULL);
+ if (node->datumSort)
+ {
+ ExecClearTuple(slot);
+ if (tuplesort_getdatum(tuplesortstate, ScanDirectionIsForward(dir),
+ &(slot->tts_values[0]), &(slot->tts_isnull[0]), NULL))
+ ExecStoreVirtualTuple(slot);
+ }
+ else
+ (void) tuplesort_gettupleslot(tuplesortstate,
+ ScanDirectionIsForward(dir),
+ false, slot, NULL);
+
return slot;
}
@@ -221,6 +268,15 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
ExecInitResultTupleSlotTL(&sortstate->ss.ps, &TTSOpsMinimalTuple);
sortstate->ss.ps.ps_ProjInfo = NULL;
+ /*
+ * We perform a Datum sort when we're sorting just a single column,
+ * otherwise we perform a tuple sort.
+ */
+ if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
+ sortstate->datumSort = true;
+ else
+ sortstate->datumSort = false;
+
SO1_printf("ExecInitSort: %s\n",
"sort node initialized");
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 105180764e..6bf650fdf3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2151,6 +2151,7 @@ typedef struct SortState
int64 bound_Done; /* value of bound we did the sort with */
void *tuplesortstate; /* private state of tuplesort.c */
bool am_worker; /* are we a worker? */
+ bool datumSort; /* Datum sort instead of tuple sort? */
SharedSortInfo *shared_info; /* one entry per worker */
} SortState;
--
2.30.2
On Tue, 20 Jul 2021 at 23:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
I took a look at cost_tuplesort and I think that some small adjustments could be made as part of the improvement process.
It is attached.
1. long is a very problematic type; better int64?
2. 1024 can be int, not long?
3. 2 changed all to 2.0 (double)?
4. If disk-based is not needed, IMO can we avoid calling relation_byte_size?Finally, to at least document (add comments) those conclusions,
would be nice, wouldn't it?
I don't think there's anything useful here. If you think otherwise,
please take it to another thread. Also, I'd recommend at least
compiling any patches you send to -hackers in the future. Going by the
CF bot, this one does not.
You might also want to read up on type promotion rules in C. Your
sort_mem calculation change does not do what you think it does. Class
it as homework to figure out what's wrong with it. No need to report
your findings here. Just thought it would be useful for you to learn
those things.
David
On Tue, Jul 20, 2021 at 4:35 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 Jul 2021 at 01:10, James Coleman <jtc331@gmail.com> wrote:
To be clear up front: I'm in favor of the patch, and I don't want to
put unnecessary stumbling blocks up for it getting committed. So if we
decide to proceed as is, that's fine with me.Thanks for making that clear.
But I'm not sure that the "cost model, unfortunately, does not account
for that" is entirely accurate. The end of cost_tuplesort contains a
cost "per extracted tuple". It does, however, note that it doesn't
charge cpu_tuple_cost, which maybe is what you'd want to fully
incorporate this into the model. But given this run_cost isn't about
accounting for comparison cost (that's been done earlier) which is the
part that'd be the same between tuple and datum sort, it seems to me
that we could lower the cpu_operator_cost here by something like 10%
if it's byref and 30% if it's byval?I failed to notice that last part that adds the additional cpu_operator_cost.
The default cpu_operator_cost is 0.0025, so with the 10k tuple
benchmark, that adds an additional charge of 25 to the total cost.If we take test 1 from my results on v5 as an example:
Test1 446.1 657.3 147.32%
Looking at explain for that query:
regression=# explain select two from tenk1 order by two offset 1000000;
QUERY PLAN
----------------------------------------------------------------------
Limit (cost=1133.95..1133.95 rows=1 width=4)
-> Sort (cost=1108.97..1133.95 rows=9995 width=4)
Sort Key: two
-> Seq Scan on tenk1 (cost=0.00..444.95 rows=9995 width=4)
(4 rows)If we want the costs to reflect reality again here then we'd have
reduce 1133.95 by something like 147.32% (the performance difference).
That would bring the cost down to 769.72, which is way more than we
have to play with than the 25 that the cpu_operator_cost * tuples
gives us.If we reduced the 25 by 30% in this case, we'd get 17.5 and the total
cost would become 1126.45. That's not great considering the actual
performance indicates that 769.72 would be a better number.If we look at John's result for test 1: He saw 588 tps on master and
998 on v8. 1133.95 / 998.0 * 588.0 = 668.09, so we'd need even more
to get close to reality on that machine.My thoughts are that the small surcharge added at the end of
cost_tuplesort() is just not enough for us to play with. I think to
get us closer to fixing this correctly would require a redesign of the
tuplesort costing entirely. I think that would be about an order of
magnitude more effort than this patch was, so I really feel like I
don't want to do this.I kinda feel that since the comparison_cost is always just 2.0 *
cpu_operator_cost regardless of the number of columns in the sort,
then if we add too many new smarts to try and properly adjust for this
new optimization, unless we do a completly new cost modal for this,
then we might as well be putting lipstick on a pig.It sounds like James mostly just mentioned the sorting just to ensure
it was properly considered and does not really feel strongly that it
needs to be adjusted. Does anyone else feel that we should be
adjusting it?
Thanks for doing the math measuring how much we could impact things.
I'm +lots on getting this committed as is.
Thanks all for your work on the improvement!
James
On Wed, 21 Jul 2021 at 13:39, James Coleman <jtc331@gmail.com> wrote:
Thanks for doing the math measuring how much we could impact things.
I'm +lots on getting this committed as is.
Ok good. I plan on taking a final look at the v10 patch tomorrow
morning NZ time (about 12 hours from now) and if all is well, I'll
push it.
If anyone feels differently, please let me know before then.
David
From: David Rowley <dgrowleyml@gmail.com>
On Wed, 21 Jul 2021 at 13:39, James Coleman <jtc331@gmail.com> wrote:
Thanks for doing the math measuring how much we could impact things.
I'm +lots on getting this committed as is.
Ok good. I plan on taking a final look at the v10 patch tomorrow morning NZ
time (about 12 hours from now) and if all is well, I'll push it.If anyone feels differently, please let me know before then.
Hi,
I noticed a minor thing about the v10 patch.
-
- for (;;)
+ if (node->datumSort)
{
- slot = ExecProcNode(outerNode);
-
- if (TupIsNull(slot))
- break;
-
- tuplesort_puttupleslot(tuplesortstate, slot);
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ slot_getsomeattrs(slot, 1);
+ tuplesort_putdatum(tuplesortstate,
+ slot->tts_values[0],
+ slot->tts_isnull[0]);
+ }
+ }
+ else
+ {
+ for (;;)
+ {
+ slot = ExecProcNode(outerNode);
+
+ if (TupIsNull(slot))
+ break;
+ tuplesort_puttupleslot(tuplesortstate, slot);
+ }
The above seems can be shorter like the following ?
for (;;)
{
slot = ExecProcNode(outerNode);
if (TupIsNull(slot))
break;
if (node->datumSort)
{
slot_getsomeattrs(slot, 1);
tuplesort_putdatum(tuplesortstate,
slot->tts_values[0],
slot->tts_isnull[0]);
}
else
tuplesort_puttupleslot(tuplesortstate, slot);
}
Best regards,
houzj
On Thu, 22 Jul 2021 at 12:27, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
The above seems can be shorter like the following ?
for (;;)
{
slot = ExecProcNode(outerNode);
if (TupIsNull(slot))
break;
if (node->datumSort)
{
slot_getsomeattrs(slot, 1);
tuplesort_putdatum(tuplesortstate,
slot->tts_values[0],
slot->tts_isnull[0]);
}
else
tuplesort_puttupleslot(tuplesortstate, slot);
}
I don't think that's a good change. It puts the branch inside the
loop the pulls all tuples from the subplan. Given the loop is likely
to be very hot combined with the fact that it's so simple, I'd much
rather have two separate loops to keep the extra branch outside the
loop. It's true the branch predictor is likely to get the prediction
correct on each iteration, but unless the compiler rewrites this into
two loops then the comparison and jump must be done per loop.
David
On July 22, 2021 8:38 AM David Rowley <dgrowleyml@gmail.com>
On Thu, 22 Jul 2021 at 12:27, houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
wrote:The above seems can be shorter like the following ?
for (;;)
{
slot = ExecProcNode(outerNode);
if (TupIsNull(slot))
break;
if (node->datumSort)
{
slot_getsomeattrs(slot, 1);
tuplesort_putdatum(tuplesortstate,
slot->tts_values[0],
slot->tts_isnull[0]);
}
else
tuplesort_puttupleslot(tuplesortstate, slot); }I don't think that's a good change. It puts the branch inside the loop the pulls
all tuples from the subplan. Given the loop is likely to be very hot combined
with the fact that it's so simple, I'd much rather have two separate loops to
keep the extra branch outside the loop. It's true the branch predictor is likely
to get the prediction correct on each iteration, but unless the compiler
rewrites this into two loops then the comparison and jump must be done per
loop.
Ah, you are right, I missed that. Thanks for the explanation.
Best regards,
houzj
On Wed, 21 Jul 2021 at 22:09, David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 21 Jul 2021 at 13:39, James Coleman <jtc331@gmail.com> wrote:
Thanks for doing the math measuring how much we could impact things.
I'm +lots on getting this committed as is.
Ok good. I plan on taking a final look at the v10 patch tomorrow
morning NZ time (about 12 hours from now) and if all is well, I'll
push it.
Pushed.
David