Add proper planner support for ORDER BY / DISTINCT aggregates
A few years ago I wrote a patch to implement the missing aggregate
combine functions for array_agg and string_agg [1]/messages/by-id/CAKJS1f9sx_6GTcvd6TMuZnNtCh0VhBzhX6FZqw17TgVFH-ga_A@mail.gmail.com. In the end, the
patch was rejected due to some concern [2]/messages/by-id/6538.1522096067@sss.pgh.pa.us that if we allow these
aggregates to run in parallel then it might mess up the order in which
values are being aggregated by some unsuspecting user who didn't
include an ORDER BY clause in the aggregate. It was mentioned at the
time that due to nodeAgg.c performing so terribly with ORDER BY
aggregates that we couldn't possibly ask users who were upset by this
change to include an ORDER BY in their aggregate functions.
I'd still quite like to finish off the remaining aggregate functions
that still don't have a combine function, so to get that going again
I've written some code that gets the query planner onboard with
picking a plan that allows nodeAgg.c to skip doing any internal sorts
when the input results are already sorted according to what's required
by the aggregate function.
Of course, the query could have many aggregates all with differing
ORDER BY clauses. Since we reuse the same input for each, it might not
be possible to feed each aggregate with suitably sorted input. When
the input is not sorted, nodeAgg.c still must perform the sort as it
does today. So unfortunately we can't remove the nodeAgg.c code for
that.
The best I could come up with is just picking a sort order that suits
the first ORDER BY aggregate, then spin through the remaining ones to
see if there's any with a more strict order requirement that we can
use to suit that one and the first one. The planner does something
similar for window functions already, although it's not quite as
comprehensive to look beyond the first window for windows with a more
strict sort requirement.
This allows us to give presorted input to both aggregates in the following case:
SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
but just the first agg in this one:
SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...
In order to make DISTINCT work, I had to add a new expression
evaluation operator to allow filtering if the current value is the
same as the last value. The current unpatched code implements
distinct when reading back the sorted tuplestore data. Since I don't
have a tuplestore with the pre-sorted version, I needed to cache the
last Datum, or last tuple somewhere. I ended up putting that in the
AggStatePerTransData struct. I'm not quite sure if I like that, but I
don't really see where else it could go.
When testing the performance of all this I found that when a suitable
index exists to provide pre-sorted input for the aggregation that the
performance does improve. Unfortunately, it looks like things get more
complex when no index exists. In this case, since we're setting
pathkeys to tell the planner we need a plan that provides pre-sorted
input to the aggregates, the planner will add a sort below the
aggregate node. I initially didn't see any problem with that as it
just moves the sort to a Sort node rather than having it done
implicitly inside nodeAgg.c. The problem is, it just does not perform
as well. I guess this is because when the sort is done inside
nodeAgg.c that the transition function is called in a tight loop while
reading records back from the tuplestore. In the patched version,
there's an additional node transition in between nodeAgg and nodeSort
and that causes slower performance. For now, I'm not quite sure what
to do about that. We set the plan pathkeys well before we could
possibly decide if asking for pre-sorted input for the aggregates
would be a good idea or not.
Please find attached my WIP patch. It's WIP due to what I mentioned
in the above paragraph and also because I've not bothered to add JIT
support for the new expression evaluation steps.
I'll add this to the July commitfest.
David
[1]: /messages/by-id/CAKJS1f9sx_6GTcvd6TMuZnNtCh0VhBzhX6FZqw17TgVFH-ga_A@mail.gmail.com
[2]: /messages/by-id/6538.1522096067@sss.pgh.pa.us
Attachments:
wip_planner_support_for_orderby_distinct_aggs_v0.patchapplication/octet-stream; name=wip_planner_support_for_orderby_distinct_aggs_v0.patchDownload+278-39
This allows us to give presorted input to both aggregates in the following
case:SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
but just the first agg in this one:
SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...
I don't know if it's acceptable, but in the case where you add both an
aggregate with an ORDER BY clause, and another aggregate without the clause,
the output for the unordered one will change and use the same ordering, maybe
suprising the unsuspecting user. Would that be acceptable ?
When testing the performance of all this I found that when a suitable
index exists to provide pre-sorted input for the aggregation that the
performance does improve. Unfortunately, it looks like things get more
complex when no index exists. In this case, since we're setting
pathkeys to tell the planner we need a plan that provides pre-sorted
input to the aggregates, the planner will add a sort below the
aggregate node. I initially didn't see any problem with that as it
just moves the sort to a Sort node rather than having it done
implicitly inside nodeAgg.c. The problem is, it just does not perform
as well. I guess this is because when the sort is done inside
nodeAgg.c that the transition function is called in a tight loop while
reading records back from the tuplestore. In the patched version,
there's an additional node transition in between nodeAgg and nodeSort
and that causes slower performance. For now, I'm not quite sure what
to do about that. We set the plan pathkeys well before we could
possibly decide if asking for pre-sorted input for the aggregates
would be a good idea or not.
I was curious about the performance implication of that additional transition,
and could not reproduce a signifcant difference. I may be doing something
wrong: how did you highlight it ?
Regards,
--
Ronan Dunklau
On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
I don't know if it's acceptable, but in the case where you add both an
aggregate with an ORDER BY clause, and another aggregate without the clause,
the output for the unordered one will change and use the same ordering, maybe
suprising the unsuspecting user. Would that be acceptable ?
That's a good question. There was an argument in [1]/messages/by-id/6538.1522096067@sss.pgh.pa.us that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate. If that group of people exists, then it's
possible they might be upset in the scenario that you describe.
I also think that it's going to be pretty hard to make significant
gains in this area if we are too scared to make changes to undefined
behaviour. You wouldn't have to look too hard in the pgsql-general
mailing list to find someone complaining that their query output is in
the wrong order on some query that does not have an ORDER BY. We
pretty much always tell those people that the order is undefined
without an ORDER BY. I'm not too sure why Tom in [1]/messages/by-id/6538.1522096067@sss.pgh.pa.us classes the ORDER
BY aggregate people any differently. We'll be stuck forever here and
in many other areas if we're too scared to change the order of
aggregation. You could argue that something like parallelism has
changed that for people already. I think the multi-batch Hash
Aggregate code could also change this.
I was curious about the performance implication of that additional transition,
and could not reproduce a signifcant difference. I may be doing something
wrong: how did you highlight it ?
It was pretty basic. I just created a table with two columns and no
index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
BY 1; the new code will include a Sort due to lack of any index and
the old code would have done a sort inside nodeAgg.c. I imagine that
the overhead comes from the fact that in the patched version nodeAgg.c
must ask its subnode (nodeSort.c) for the next tuple each time,
whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
and can fetch them very quickly in a tight loop.
David
On 2/07/21 8:39 pm, David Rowley wrote:
On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
I don't know if it's acceptable, but in the case where you add both an
aggregate with an ORDER BY clause, and another aggregate without the clause,
the output for the unordered one will change and use the same ordering, maybe
suprising the unsuspecting user. Would that be acceptable ?That's a good question. There was an argument in [1] that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate. If that group of people exists, then it's
possible they might be upset in the scenario that you describe.
[...]
I've always worked on the assumption that if I do not specify an ORDER
BY clause then the rdbms is expected to present rows in the order most
efficient for it to do so. If pg orders rows when not requested then
this could add extra elapsed time to the query, which might be
significant for large queries.
I don't know of any rdbms that guarantees the order of returned rows
when an ORDER BY clause is not used.
So I think that pg has no obligation to protect the sensibilities of
naive users in this case, especially at the expense of users that want
queries to complete as quickly as possible. IMHO
Cheers,
Gavin
Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
On 2/07/21 8:39 pm, David Rowley wrote:
That's a good question. There was an argument in [1] that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate. If that group of people exists, then it's
possible they might be upset in the scenario that you describe.
So I think that pg has no obligation to protect the sensibilities of
naive users in this case, especially at the expense of users that want
queries to complete as quickly as possible. IMHO
I agree. We've broken such expectations in the past and I don't
have much hesitation about breaking them again.
regards, tom lane
Le vendredi 2 juillet 2021, 10:39:44 CEST David Rowley a écrit :
On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
I don't know if it's acceptable, but in the case where you add both an
aggregate with an ORDER BY clause, and another aggregate without the
clause, the output for the unordered one will change and use the same
ordering, maybe suprising the unsuspecting user. Would that be acceptable
?That's a good question. There was an argument in [1] that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate. If that group of people exists, then it's
possible they might be upset in the scenario that you describe.I also think that it's going to be pretty hard to make significant
gains in this area if we are too scared to make changes to undefined
behaviour. You wouldn't have to look too hard in the pgsql-general
mailing list to find someone complaining that their query output is in
the wrong order on some query that does not have an ORDER BY. We
pretty much always tell those people that the order is undefined
without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
BY aggregate people any differently. We'll be stuck forever here and
in many other areas if we're too scared to change the order of
aggregation. You could argue that something like parallelism has
changed that for people already. I think the multi-batch Hash
Aggregate code could also change this.
I would agree with that.
I was curious about the performance implication of that additional
transition, and could not reproduce a signifcant difference. I may be
doing something wrong: how did you highlight it ?It was pretty basic. I just created a table with two columns and no
index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
BY 1; the new code will include a Sort due to lack of any index and
the old code would have done a sort inside nodeAgg.c. I imagine that
the overhead comes from the fact that in the patched version nodeAgg.c
must ask its subnode (nodeSort.c) for the next tuple each time,
whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
and can fetch them very quickly in a tight loop.
Ok, I reproduced that case, just not using a group by: by adding the group by
a sort node is added in both cases (master and your patch), except that with
your patch we sort on both keys and that doesn't really incur a performance
penalty.
I think the overhead occurs because in the ExecAgg case, we use the
tuplesort_*_datum API as an optimization when we have a single column as an
input, which the ExecSort code doesn't. Maybe it would be worth it to try to
use that API in sort nodes too, when it can be done.
David
--
Ronan Dunklau
Ok, I reproduced that case, just not using a group by: by adding the group
by a sort node is added in both cases (master and your patch), except that
with your patch we sort on both keys and that doesn't really incur a
performance penalty.I think the overhead occurs because in the ExecAgg case, we use the
tuplesort_*_datum API as an optimization when we have a single column as an
input, which the ExecSort code doesn't. Maybe it would be worth it to try to
use that API in sort nodes too, when it can be done.
Please find attached a POC patch to do just that.
The switch to the single-datum tuplesort is done when there is only one
attribute, it is byval (to avoid having to deal with copy of the references
everywhere) and we are not in bound mode (to also avoid having to move things
around).
A naive run on make check pass on this, but I may have overlooked things.
Should I add this separately to the commitfest ?
For the record, the times I got on my laptop, on master VS david's patch VS
both. Values are an average of 100 runs, as reported by pgbench --no-vacuum -f
<file.sql> -t 100. There is a good amount of noise, but the simple "select one
ordered column case" seems worth the optimization.
Only shared_buffers and work_mem have been set to 2GB each.
Setup 1: single table, 1 000 000 tuples, no index
CREATE TABLE tbench (
a int,
b int
);
INSERT INTO tbench (a, b) SELECT a, b FROM generate_series(1, 100) a,
generate_series(1, 10000) b;
Test 1: Single-column ordered select (order by b since the table is already
sorted by a)
select b from tbench order by b;
master: 303.661ms
with mine: 148.571ms
Test 2: Ordered sum (using b so that the input is not presorted)
select sum(b order by b) from tbench;
master: 112.379ms
with david's patch: 144.469ms
with david's patch + mine: 97ms
Test 3: Ordered sum + group by
select b, sum(a order by a) from tbench GROUP BY b;
master: 316.117ms
with david's patch: 297.079
with david's patch + mine: 294.601
Setup 2: same as before, but adding an index on (b, a)
CREATE INDEX ON tbench (b, a);
Test 2: Ordered sum:
select sum(a order by a) from tbench;
master: 111.847 ms
with david's patch: 48.088
with david's patch + mine: 47.678 ms
Test 3: Ordered sum + group by:
select a, sum(b order by b) from tbench GROUP BY a;
master: 76.873 ms
with david's patch: 61.105
with david's patch + mine: 62.672 ms
--
Ronan Dunklau
Attachments:
0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchtext/x-patch; charset=UTF-8; name=0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+48-16
Please find attached a POC patch to do just that.
The switch to the single-datum tuplesort is done when there is only one
attribute, it is byval (to avoid having to deal with copy of the
references
everywhere) and we are not in bound mode (to also avoid having to move
things
around).
Hi, nice results!
I have a few suggestions and questions to your patch:
1. Why do you moved the declaration of variable *plannode?
I think this is unnecessary, extend the scope.
2. Why do you declare a new variable TupleDesc out_tuple_desc at
ExecInitSort?
I think this is unnecessary too, maybe I didn't notice something.
3. I inverted the order of check at this line, I think "!node-bounded" is
more cheap that TupleDescAttr(tupDesc, 0) ->attbyval
4. Once that you changed the order of execution, this test is unlikely that
happens, so add unlikely helps the branch.
5. I think that you add a invariant inside the loop
"if(node->is_single_val)"?
Would not be better two fors?
For you convenience, I attached a v2 version (with styles changes), please
take a look and can you repeat yours tests?
regards,
Ranier Vilela
Attachments:
v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchapplication/octet-stream; name=v2-0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patchDownload+57-22
Import Notes
Resolved by subject fallback
Le lundi 5 juillet 2021, 16:51:59 CEST Ranier Vilela a écrit :
Please find attached a POC patch to do just that.
The switch to the single-datum tuplesort is done when there is only one
attribute, it is byval (to avoid having to deal with copy of thereferences
everywhere) and we are not in bound mode (to also avoid having to move
things
around).
Hi, nice results!
I have a few suggestions and questions to your patch:
Thank you for those !
1. Why do you moved the declaration of variable *plannode?
I think this is unnecessary, extend the scope.
Sorry, I should have cleaned it up before sending.
2. Why do you declare a new variable TupleDesc out_tuple_desc at
ExecInitSort?
I think this is unnecessary too, maybe I didn't notice something.
Same as the above, thanks for the two.
3. I inverted the order of check at this line, I think "!node-bounded" is
more cheap that TupleDescAttr(tupDesc, 0) ->attbyval
I'm not sure it matters since it's done once per sort but Ok
4. Once that you changed the order of execution, this test is unlikely that
happens, so add unlikely helps the branch.
Ok.
5. I think that you add a invariant inside the loop
"if(node->is_single_val)"?
Would not be better two fors?
Ok for me.
For you convenience, I attached a v2 version (with styles changes), please
take a look and can you repeat yours tests?
Tested it quickly, and did not see any change performance wise that cannot be
attributed to noise on my laptop but it's fine.
Thank you for the fixes !
regards,
Ranier Vilela
--
Ronan Dunklau
Em seg., 5 de jul. de 2021 às 12:07, Ronan Dunklau <ronan.dunklau@aiven.io>
escreveu:
Le lundi 5 juillet 2021, 16:51:59 CEST Ranier Vilela a écrit :
Please find attached a POC patch to do just that.
The switch to the single-datum tuplesort is done when there is only one
attribute, it is byval (to avoid having to deal with copy of thereferences
everywhere) and we are not in bound mode (to also avoid having to move
things
around).
Hi, nice results!
I have a few suggestions and questions to your patch:
Thank you for those !
1. Why do you moved the declaration of variable *plannode?
I think this is unnecessary, extend the scope.Sorry, I should have cleaned it up before sending.
2. Why do you declare a new variable TupleDesc out_tuple_desc at
ExecInitSort?
I think this is unnecessary too, maybe I didn't notice something.Same as the above, thanks for the two.
3. I inverted the order of check at this line, I think "!node-bounded" is
more cheap that TupleDescAttr(tupDesc, 0) ->attbyvalI'm not sure it matters since it's done once per sort but Ok
4. Once that you changed the order of execution, this test is unlikely
that
happens, so add unlikely helps the branch.
Ok.
5. I think that you add a invariant inside the loop
"if(node->is_single_val)"?
Would not be better two fors?Ok for me.
For you convenience, I attached a v2 version (with styles changes),
please
take a look and can you repeat yours tests?
Tested it quickly, and did not see any change performance wise that cannot
be
attributed to noise on my laptop but it's fine.
Thanks for testing again.
Thank you for the fixes !
You are welcome.
regards,
Ranier Vilela
On Sat, Jun 12, 2021 at 11:07 AM David Rowley <dgrowleyml@gmail.com> wrote:
A few years ago I wrote a patch to implement the missing aggregate
combine functions for array_agg and string_agg [1]. In the end, the
patch was rejected due to some concern [2] that if we allow these
aggregates to run in parallel then it might mess up the order in which
values are being aggregated by some unsuspecting user who didn't
include an ORDER BY clause in the aggregate. It was mentioned at the
time that due to nodeAgg.c performing so terribly with ORDER BY
aggregates that we couldn't possibly ask users who were upset by this
change to include an ORDER BY in their aggregate functions.I'd still quite like to finish off the remaining aggregate functions
that still don't have a combine function, so to get that going again
I've written some code that gets the query planner onboard with
picking a plan that allows nodeAgg.c to skip doing any internal sorts
when the input results are already sorted according to what's required
by the aggregate function.Of course, the query could have many aggregates all with differing
ORDER BY clauses. Since we reuse the same input for each, it might not
be possible to feed each aggregate with suitably sorted input. When
the input is not sorted, nodeAgg.c still must perform the sort as it
does today. So unfortunately we can't remove the nodeAgg.c code for
that.The best I could come up with is just picking a sort order that suits
the first ORDER BY aggregate, then spin through the remaining ones to
see if there's any with a more strict order requirement that we can
use to suit that one and the first one. The planner does something
similar for window functions already, although it's not quite as
comprehensive to look beyond the first window for windows with a more
strict sort requirement.
I think this is a reasonable choice. I could imagine a more complex
method, say, counting the number of aggregates benefiting from a given
sort, and choosing the one that benefits the most (and this could be
further complicated by ranking based on "cost" -- not costing in the
normal sense since we don't have that at this point), but I think it'd
take a lot of convincing that that was valuable.
This allows us to give presorted input to both aggregates in the following case:
SELECT agg(a ORDER BY a),agg2(a ORDER BY a,b) ...
but just the first agg in this one:
SELECT agg(a ORDER BY a),agg2(a ORDER BY c) ...
In order to make DISTINCT work, I had to add a new expression
evaluation operator to allow filtering if the current value is the
same as the last value. The current unpatched code implements
distinct when reading back the sorted tuplestore data. Since I don't
have a tuplestore with the pre-sorted version, I needed to cache the
last Datum, or last tuple somewhere. I ended up putting that in the
AggStatePerTransData struct. I'm not quite sure if I like that, but I
don't really see where else it could go.
That sounds like what nodeIncrementalSort.c's isCurrentGroup() does,
except it's just implemented inline. Not anything you need to change
in this patch, but noting it in case it triggered a thought valuable
for you for me later on.
When testing the performance of all this I found that when a suitable
index exists to provide pre-sorted input for the aggregation that the
performance does improve. Unfortunately, it looks like things get more
complex when no index exists. In this case, since we're setting
pathkeys to tell the planner we need a plan that provides pre-sorted
input to the aggregates, the planner will add a sort below the
aggregate node. I initially didn't see any problem with that as it
just moves the sort to a Sort node rather than having it done
implicitly inside nodeAgg.c. The problem is, it just does not perform
as well. I guess this is because when the sort is done inside
nodeAgg.c that the transition function is called in a tight loop while
reading records back from the tuplestore. In the patched version,
there's an additional node transition in between nodeAgg and nodeSort
and that causes slower performance. For now, I'm not quite sure what
to do about that. We set the plan pathkeys well before we could
possibly decide if asking for pre-sorted input for the aggregates
would be a good idea or not.
This opens up another path for significant plan benefits too: if
there's now an explicit sort node, then it's possible for that node to
be an incremental sort node, which isn't something nodeAgg is capable
of utilizing currently.
Please find attached my WIP patch. It's WIP due to what I mentioned
in the above paragraph and also because I've not bothered to add JIT
support for the new expression evaluation steps.
I looked this over (though didn't get a chance to play with it).
I'm wondering about the changes to the test output in tuplesort.out.
It looks like where a merge join used to be proceeding with an
explicit sort with DESC it's now sorting with (implicit) ASC, and then
an explicit sort node using DESC is above the merge join node. Two
thoughts:
1. It seems like losing the "proper" sort order in the JOIN node isn't
preferable. Given both are full sorts, it may not actually be a
significant performance difference on its own (it can't short
circuit), but it means precluding using incremental sort in the new
sort node.
2. In an ideal world we'd push the more complex sort down into the
merge join rather than doing a sort on "col1" and then a sort on
"col1, col2". That's likely beyond this patch, but I haven't
investigated at all.
Thanks for your work on this; both this patch and the original one
you're trying to enable seem like great additions.
James
On Mon, Jul 5, 2021 at 8:08 AM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
Ok, I reproduced that case, just not using a group by: by adding the group
by a sort node is added in both cases (master and your patch), except that
with your patch we sort on both keys and that doesn't really incur a
performance penalty.I think the overhead occurs because in the ExecAgg case, we use the
tuplesort_*_datum API as an optimization when we have a single column as an
input, which the ExecSort code doesn't. Maybe it would be worth it to try to
use that API in sort nodes too, when it can be done.Please find attached a POC patch to do just that.
The switch to the single-datum tuplesort is done when there is only one
attribute, it is byval (to avoid having to deal with copy of the references
everywhere) and we are not in bound mode (to also avoid having to move things
around).A naive run on make check pass on this, but I may have overlooked things.
Should I add this separately to the commitfest ?
It seems like a pretty obvious win on its own, and, I'd expect, will
need less discussion than David's patch, so my vote is to make it a
separate thread. The patch tester wants the full series attached each
time, and even without that it's difficult to discuss multiple patches
on a single thread.
If you make a separate thread and CF entry, please CC me and add me as
a reviewer on the CF entry.
One thing from a quick read through of the patch: your changes near
the end of ExecSort, in ExecInitSort, and in execnodes.h need
indentation fixes.
Thanks,
James
If you make a separate thread and CF entry, please CC me and add me as
a reviewer on the CF entry.
Ok, I started a new thread and added it to the next CF: https://
commitfest.postgresql.org/34/3237/
--
Ronan Dunklau
On Mon, 5 Jul 2021 at 18:38, Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
I think the overhead occurs because in the ExecAgg case, we use the
tuplesort_*_datum API as an optimization when we have a single column as an
input, which the ExecSort code doesn't. Maybe it would be worth it to try to
use that API in sort nodes too, when it can be done.
That's a really great find! Looks like I was wrong to assume that the
extra overhead was from transitioning between nodes.
I ran the performance results locally here with:
create table t1(a int not null);
create table t2(a int not null, b int not null);
create table t3(a int not null, b int not null, c int not null);
insert into t1 select x from generate_Series(1,1000000)x;
insert into t2 select x,x from generate_Series(1,1000000)x;
insert into t3 select x,x,1 from generate_Series(1,1000000)x;
vacuum freeze analyze t1,t2,t3;
select1: select sum(a order by a) from t1;
select2: select sum(a order by b) from t2;
select3: select c,sum(a order by b) from t3 group by c;
master = https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8aafb02616753f5c6c90bbc567636b73c0cbb9d4
patch1 = /messages/by-id/attachment/123546/wip_planner_support_for_orderby_distinct_aggs_v0.patch
patch2 = /messages/by-id/attachment/124238/0001-Allow-Sort-nodes-to-use-the-fast-single-datum-tuples.patch
The attached graph shows the results.
It's very good to see that with both patches applied there's no
regression. I'm a bit surprised there's much performance gain here
given that I didn't add any indexes to provide any presorted input.
David
Attachments:
On Sun, 13 Jun 2021 at 03:07, David Rowley <dgrowleyml@gmail.com> wrote:
Please find attached my WIP patch. It's WIP due to what I mentioned
in the above paragraph and also because I've not bothered to add JIT
support for the new expression evaluation steps.
I've split this patch into two parts.
0001 Adds planner support for ORDER BY aggregates.
0002 is a WIP patch for DISTINCT support. This still lacks JIT
support and I'm still not certain of the best where to store the
previous value or tuple to determine if the current one is distinct
from it.
The 0001 patch is fairly simple and does not require much in the way
of changes in the planner aside from standard_qp_callback().
Surprisingly the executor does not need a great deal of work here
either. It's just mostly about skipping the normal agg(.. ORDER BY)
code when the Aggref is presorted.
David
Attachments:
v2-0001-Add-planner-support-for-ORDER-BY-aggregates.patchapplication/octet-stream; name=v2-0001-Add-planner-support-for-ORDER-BY-aggregates.patchDownload+115-13
v2-0002-WIP-Add-planner-support-for-DISTINCT-aggregates.patchapplication/octet-stream; name=v2-0002-WIP-Add-planner-support-for-DISTINCT-aggregates.patchDownload+175-39
Em seg., 12 de jul. de 2021 às 09:04, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Sun, 13 Jun 2021 at 03:07, David Rowley <dgrowleyml@gmail.com> wrote:
Please find attached my WIP patch. It's WIP due to what I mentioned
in the above paragraph and also because I've not bothered to add JIT
support for the new expression evaluation steps.I've split this patch into two parts.
Hi, I'll take a look.
0001 Adds planner support for ORDER BY aggregates.
/* Normal transition function without ORDER BY / DISTINCT. */
Is it possible to avoid entering to initialize args if 'argno >=
pertrans->numTransInputs'?
Like this:
if (!pertrans->aggsortrequired && argno < pertrans->numTransInputs)
And if argos is '>' that pertrans->numTransInputs?
The test shouldn't be, inside the loop?
+ /*
+ * Don't initialize args for any ORDER BY clause that might
+ * exist in a presorted aggregate.
+ */
+ if (argno >= pertrans->numTransInputs)
+ break;
I think that or can reduce the scope of variable 'sortlist' or simply
remove it?
a)
+ /* Determine pathkeys for aggregate functions with an ORDER BY */
+ if (parse->groupingSets == NIL && root->numOrderedAggs > 0 &&
+ (qp_extra->groupClause == NIL || root->group_pathkeys))
+ {
+ ListCell *lc;
+ List *pathkeys = NIL;
+
+ foreach(lc, root->agginfos)
+ {
+ AggInfo *agginfo = (AggInfo *) lfirst(lc);
+ Aggref *aggref = agginfo->representative_aggref;
+ List *sortlist;
+
or better
b)
+ /* Determine pathkeys for aggregate functions with an ORDER BY */
+ if (parse->groupingSets == NIL && root->numOrderedAggs > 0 &&
+ (qp_extra->groupClause == NIL || root->group_pathkeys))
+ {
+ ListCell *lc;
+ List *pathkeys = NIL;
+
+ foreach(lc, root->agginfos)
+ {
+ AggInfo *agginfo = (AggInfo *) lfirst(lc);
+ Aggref *aggref = agginfo->representative_aggref;
+
+ if (AGGKIND_IS_ORDERED_SET(aggref->aggkind))
+ continue;
+
+ /* DISTINCT aggregates not yet supported by the planner */
+ if (aggref->aggdistinct != NIL)
+ continue;
+
+ if (aggref->aggorder == NIL)
+ continue;
+
+ /*
+ * Find the pathkeys with the most sorted derivative of the first
+ * Aggref. For example, if we determine the pathkeys for the first
+ * Aggref to be {a}, and we find another with {a,b}, then we use
+ * {a,b} since it's useful for more Aggrefs than just {a}. We
+ * currently ignore anything that might have a longer list of
+ * pathkeys than the first Aggref if it is not contained in the
+ * pathkeys for the first agg. We can't practically plan for all
+ * orders of each Aggref, so this seems like the best compromise.
+ */
+ if (pathkeys == NIL)
+ {
+ pathkeys = make_pathkeys_for_sortclauses(root, aggref->aggorder ,
+ aggref->args);
+ aggref->aggpresorted = true;
+ }
+ else
+ {
+ List *pathkeys2 = make_pathkeys_for_sortclauses(root,
+ aggref->aggorder,
+ aggref->args);
0002 is a WIP patch for DISTINCT support. This still lacks JIT
support and I'm still not certain of the best where to store the
previous value or tuple to determine if the current one is distinct
from it.
In the patch 0002, I think that can reduce the scope of variable 'aggstate'?
+ EEO_CASE(EEOP_AGG_PRESORTED_DISTINCT_SINGLE)
+ {
+ AggStatePerTrans pertrans = op->d.agg_presorted_distinctcheck.pertrans;
+ Datum value = pertrans->transfn_fcinfo->args[1].value;
+ bool isnull = pertrans->transfn_fcinfo->args[1].isnull;
+
+ if (!pertrans->haslast ||
+ pertrans->lastisnull != isnull ||
+ !DatumGetBool(FunctionCall2Coll(&pertrans->equalfnOne,
+ pertrans->aggCollation,
+ pertrans->lastdatum, value)))
+ {
+ if (pertrans->haslast && !pertrans->inputtypeByVal)
+ pfree(DatumGetPointer(pertrans->lastdatum));
+
+ pertrans->haslast = true;
+ if (!isnull)
+ {
+ AggState *aggstate = castNode(AggState, state->parent);
+
+ /*
+ * XXX is it worth having a dedicted ByVal version of this
+ * operation so that we can skip switching memory contexts
+ * and do a simple assign rather than datumCopy below?
+ */
+ MemoryContext oldContext;
+
+ oldContext =
MemoryContextSwitchTo(aggstate->curaggcontext->ecxt_per_tuple_memory);
What do you think?
regards,
Ranier Vilela
Thanks for having a look at this.
On Tue, 13 Jul 2021 at 11:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
0001 Adds planner support for ORDER BY aggregates.
/* Normal transition function without ORDER BY / DISTINCT. */
Is it possible to avoid entering to initialize args if 'argno >= pertrans->numTransInputs'?
Like this:if (!pertrans->aggsortrequired && argno < pertrans->numTransInputs)
And if argos is '>' that pertrans->numTransInputs?
The test shouldn't be, inside the loop?+ /* + * Don't initialize args for any ORDER BY clause that might + * exist in a presorted aggregate. + */ + if (argno >= pertrans->numTransInputs) + break;
The idea is to stop the loop before processing any Aggref arguments
that might belong to the ORDER BY clause. We must still process other
arguments up to the ORDER BY args though, so we can't skip this loop.
Note that we're doing argno++ inside the loop. If we had a
for_each_to() macro, similar to for_each_from(), but allowed us to
specify an end element then we could use that instead, but we don't
and we still must initialize the transition arguments.
I think that or can reduce the scope of variable 'sortlist' or simply remove it?
I've adjusted the scope of this. I didn't want to remove it because
it's kinda useful to have it that way otherwise the 0002 patch would
need to add it.
0002 is a WIP patch for DISTINCT support. This still lacks JIT
support and I'm still not certain of the best where to store the
previous value or tuple to determine if the current one is distinct
from it.In the patch 0002, I think that can reduce the scope of variable 'aggstate'?
+ EEO_CASE(EEOP_AGG_PRESORTED_DISTINCT_SINGLE)
Yeah, that could be done.
I've attached the updated patches.
David
Attachments:
v3-0001-Add-planner-support-for-ORDER-BY-aggregates.patchapplication/octet-stream; name=v3-0001-Add-planner-support-for-ORDER-BY-aggregates.patchDownload+115-13
v3-0002-WIP-Add-planner-support-for-DISTINCT-aggregates.patchapplication/octet-stream; name=v3-0002-WIP-Add-planner-support-for-DISTINCT-aggregates.patchDownload+176-39
Em ter., 13 de jul. de 2021 às 01:44, David Rowley <dgrowleyml@gmail.com>
escreveu:
Thanks for having a look at this.
On Tue, 13 Jul 2021 at 11:04, Ranier Vilela <ranier.vf@gmail.com> wrote:
0001 Adds planner support for ORDER BY aggregates.
/* Normal transition function without ORDER BY / DISTINCT. */
Is it possible to avoid entering to initialize args if 'argno >=pertrans->numTransInputs'?
Like this:
if (!pertrans->aggsortrequired && argno < pertrans->numTransInputs)
And if argos is '>' that pertrans->numTransInputs?
The test shouldn't be, inside the loop?+ /* + * Don't initialize args for any ORDER BY clause that might + * exist in a presorted aggregate. + */ + if (argno >= pertrans->numTransInputs) + break;The idea is to stop the loop before processing any Aggref arguments
that might belong to the ORDER BY clause.
Yes, I get the idea.
We must still process other
arguments up to the ORDER BY args though,
I may have misunderstood, but the other arguments are under
pertrans->numTransInputs?
so we can't skip this loop.
The question not answered is if *argno* can '>=' that
pertrans->numTransInputs,
before entering the loop?
If *can*, the loop might be useless in that case.
Note that we're doing argno++ inside the loop.
Another question is, if *argno* can '>' that pertrans->numTransInputs,
before the loop, the test will fail?
if (argno == pertrans->numTransInputs)
I think that or can reduce the scope of variable 'sortlist' or simply
remove it?
I've adjusted the scope of this. I didn't want to remove it because
it's kinda useful to have it that way otherwise the 0002 patch would
need to add it.
Nice.
0002 is a WIP patch for DISTINCT support. This still lacks JIT
support and I'm still not certain of the best where to store the
previous value or tuple to determine if the current one is distinct
from it.In the patch 0002, I think that can reduce the scope of variable
'aggstate'?
+ EEO_CASE(EEOP_AGG_PRESORTED_DISTINCT_SINGLE)
Yeah, that could be done.
I've attached the updated patches.
Thanks.
regards,
Ranier Vilela
On Tue, 13 Jul 2021 at 23:45, Ranier Vilela <ranier.vf@gmail.com> wrote:
The question not answered is if *argno* can '>=' that pertrans->numTransInputs,
before entering the loop?
If *can*, the loop might be useless in that case.Note that we're doing argno++ inside the loop.
Another question is, if *argno* can '>' that pertrans->numTransInputs,
before the loop, the test will fail?
if (argno == pertrans->numTransInputs)
argno is *always* 0 before the loop starts.
David
Em ter., 13 de jul. de 2021 às 22:15, David Rowley <dgrowleyml@gmail.com>
escreveu:
On Tue, 13 Jul 2021 at 23:45, Ranier Vilela <ranier.vf@gmail.com> wrote:
The question not answered is if *argno* can '>=' that
pertrans->numTransInputs,
before entering the loop?
If *can*, the loop might be useless in that case.Note that we're doing argno++ inside the loop.
Another question is, if *argno* can '>' that pertrans->numTransInputs,
before the loop, the test will fail?
if (argno == pertrans->numTransInputs)argno is *always* 0 before the loop starts.
Good. Thanks.
regards,
Ranier Vilela