Is postgres able to share sorts required by common partition window functions?
Hi all,
I'm trying to optimize the following query on postgres 11.6 (running on
Aurora)
select distinct
c1,
first_value(c2) OVER (PARTITION BY c1 order by c2) AS c2,
first_value(c3) OVER (PARTITION BY c1 order by c3) AS c3,
first_value(c4) OVER (PARTITION BY c1 order by c4) AS c4
from
t;
From the explain plan (attached at the end of the email) I see that
postgresql is doing several sorts one per window function and one for the
distinct that seems ok.
However all the window functions being on the same partition I would have
expected postgresql to "share" a preliminary sort on c1 that would then be
useful to reduce the work on all window functions but it doesn't.
I even created an index on c1 hoping that postgresql would be able to use
it in order to minimize the cost of the sorts but I couldn't make it use it.
Is there something I am missing?
You can find below a script to set up a table and data to reproduce as well
as the explain plan.
*Setup Script*
create table t(
pk varchar(200) PRIMARY key,
c1 varchar(200),
c2 varchar(200),
c3 varchar(200),
c4 varchar(200)
);
create index i1 on t (c1);
insert into t
(pk, c1, c2, c3, c4 )
select
generate_series::text pk,
'Grp' ||(generate_series / 4)::text c1,
generate_series::text c2,
generate_series::text c3,
generate_series::text c4
from generate_series(0, 1000000);
*Explain Plan*
Unique (cost=808480.87..820980.88 rows=1000001 width=123) (actual
time=7131.675..7781.082 rows=250001 loops=1)
-> Sort (cost=808480.87..810980.87 rows=1000001 width=123) (actual
time=7131.673..7603.926 rows=1000001 loops=1)
Sort Key: c1, (first_value(c2) OVER (?)), (first_value(c3) OVER
(?)), (first_value(c4) OVER (?))
Sort Method: external merge Disk: 59640kB
-> WindowAgg (cost=558937.90..578937.92 rows=1000001 width=123)
(actual time=5179.374..6268.937 rows=1000001 loops=1)
-> Sort (cost=558937.90..561437.90 rows=1000001 width=91)
(actual time=5179.355..5679.136 rows=1000001 loops=1)
Sort Key: c1, c4
Sort Method: external merge Disk: 52912kB
-> WindowAgg (cost=336736.93..356736.95 rows=1000001
width=91) (actual time=3260.950..4389.116 rows=1000001 loops=1)
-> Sort (cost=336736.93..339236.93 rows=1000001
width=59) (actual time=3260.934..3778.385 rows=1000001 loops=1)
Sort Key: c1, c3
Sort Method: external merge Disk: 46176kB
-> WindowAgg (cost=141877.96..161877.98
rows=1000001 width=59) (actual time=1444.692..2477.284 rows=1000001 loops=1)
-> Sort (cost=141877.96..144377.96
rows=1000001 width=27) (actual time=1444.669..1906.993 rows=1000001 loops=1)
Sort Key: c1, c2
Sort Method: external merge
Disk: 39424kB
-> Seq Scan on t
(cost=0.00..18294.01 rows=1000001 width=27) (actual time=0.011..177.815
rows=1000001 loops=1)
Planning Time: 0.214 ms
Execution Time: 7839.646 ms
Does this give the same result and do the optimization you want?
select
c1,
min(c2) AS c2,
min(c3) AS c3,
min(c4) AS c4
from
t
group by
c1;
Show quoted text
Hi Michael,
I simplified the real query before posting it here and I now realize that I
oversimplified things.
Unfortunately the real query cannot be re-written with a group by.
Some of the window functions are more complex with order by clause using
complex expressions involving multiple columns.
The following example would be more realistic:
select distinct
c1,
first_value(c2) OVER (PARTITION BY c1 order by c2, c4) AS c2,
first_value(c3) OVER (PARTITION BY c1 order by coalesce(c4, '000'), c3)
AS c3
from
t;
On Mon, Jul 6, 2020 at 9:55 PM Michael Lewis <mlewis@entrata.com> wrote:
Show quoted text
Does this give the same result and do the optimization you want?
select
c1,
min(c2) AS c2,
min(c3) AS c3,
min(c4) AS c4
from
t
group by
c1;
Distinct is a great way to get quick results when writing quick &
dirty queries, but I rarely have them perform better than a re-write that
avoids the need. It collects a ton of results, orders them, and throws away
duplicates in the process. I don't love the idea of that extra work. Did
you say you have an index on c1?
select
c1,
sub1.c2,
sub2.c3
from
t
join lateral (select c2 from t1 where t1.c1 = t.c1 order by c2, c4 limit 1
) as sub1 on true
join lateral (select c3 from t1 where t1.c1 = t.c1 order by coalesce(c4,
'000'), c3 limit 1 ) as sub2 on true;
select
c1,
(select c2 from t1 where t1.c1 = t.c1 order by c2, c4 limit 1 ) AS c2,
(select c3 from t1 where t1.c1 = t.c1 order by coalesce(c4, '000'), c3
limit 1 ) AS c3
from
t;
I don't know the data, but I assume there may be many rows with the same c1
value, so then you would likely benefit from getting that distinct set
first like below as your FROM table.
*(select c1 from t group by c1 ) AS t*
On Monday, July 6, 2020, Michael Lewis <mlewis@entrata.com> wrote:
Did you say you have an index on c1?
[...]
I don't know the data, but I assume there may be many rows with the same
c1 value, so then you would likely benefit from getting that distinct set
first like below as your FROM table.
Re-reading the original email I see both the answer to your question and
the data being queried.
David J.
On Monday, July 6, 2020, Sebastien Arod <sebastien.arod@gmail.com> wrote:
I would have expected postgresql to "share" a preliminary sort on c1 that
would then be useful to reduce the work on all window functions but it
doesn't.
The plan shown does share - the output of one sort goes into another.
Subsequent sorts still have to happen but they should be faster as the
first field is already grouped. Doesn’t change the plan though.
I even created an index on c1 hoping that postgresql would be able to use
it in order to minimize the cost of the sorts but I couldn't make it use it.
Use it how? You are still evaluating 250k groups so doing anything
piece-wise seems like an expected loss compared to sequential scan.
David J.
Michael, David thanks for your quick replies.
*@Michael*
I initially dismissed writing this query using joins or subselects because
the real query has about 80 columns and I was afraid that having 80
joins/subselect would cause issues with postgresql including planner that
would fallback to GEQO.
I'll test it anyway with real data and see how it behaves.
*@David About Subsequent sorts*
From my experiments the time of subsequent sorts is not reduced by previous
sort but is in fact higher.
In the following example in Query 1 Sort node on key c1, c2 has a time of
~1893 excluding child nodes
In the following example in Query 2
- Sort node on key c1 has a time of ~1558 excluding child nodes
- Sort node on key c1, c2 has a time of ~2964 excluding child nodes
which is higher than the time for the same sort key in Query 1 but in Query
2 data postgresql could have benefited from a preliminary sort on c1.
I can't figure why it would be slower.
*Query1*
explain analyze select
first_value(c2) OVER (PARTITION BY c1 order by c2)
from t;
*Explain Plan Query1*
WindowAgg (cost=135042.46..155042.48 rows=1000001 width=47) (actual
time=1555.005..2639.659 rows=1000001 loops=1)
-> Sort (cost=135042.46..137542.46 rows=1000001 width=15) (actual
time=1554.989..2070.462 rows=1000001 loops=1)
Sort Key: c1, c2
Sort Method: external merge Disk: 25960kB
-> Seq Scan on t (cost=0.00..18294.01 rows=1000001 width=15)
(actual time=0.013..177.037 rows=1000001 loops=1)
Planning Time: 0.115 ms
Execution Time: 2693.935 ms
*Query2*
explain analyze select
first_value(c2) OVER (PARTITION BY c1),
first_value(c2) OVER (PARTITION BY c1 order by c2)
from t;
*Explain Plan Query2*
WindowAgg (cost=313730.43..333730.45 rows=1000001 width=79) (actual
time=4692.257..5702.226 rows=1000001 loops=1)
-> Sort (cost=313730.43..316230.43 rows=1000001 width=47) (actual
time=4692.152..5174.928 rows=1000001 loops=1)
Sort Key: c1, c2
Sort Method: external merge Disk: 32688kB
-> WindowAgg (cost=135042.46..152542.48 rows=1000001 width=47)
(actual time=1188.109..2210.845 rows=1000001 loops=1)
-> Sort (cost=135042.46..137542.46 rows=1000001 width=15)
(actual time=1188.099..1709.253 rows=1000001 loops=1)
Sort Key: c1
Sort Method: external merge Disk: 25960kB
-> Seq Scan on t (cost=0.00..18294.01 rows=1000001
width=15) (actual time=0.014..150.435 rows=1000001 loops=1)
Planning Time: 0.059 ms
Execution Time: 5756.591 ms
*@David About using an index*
You are correct that a simple index cannot be leveraged efficiently.
However a covering index including all columns and starting with the
partition column c1 could be used to avoid sorting on c1 right?
But in my tests it is not used. In the following example covering_index
could be used to avoid filtering on c1 but it doesn't:
*Covering Index*
create index covering_index on t (c1,c2, c3,c4);
*Query*
explain analyze select
c1,
first_value(c2) OVER (PARTITION BY c1 order by c2, c4) AS c2,
first_value(c3) OVER (PARTITION BY c1 order by coalesce(c4, '000'), c3)
AS c3
from
t;
*Plan*
WindowAgg (cost=417853.93..440353.95 rows=1000001 width=123) (actual
time=3071.980..4184.030 rows=1000001 loops=1)
-> Sort (cost=417853.93..420353.93 rows=1000001 width=91) (actual
time=3071.962..3575.641 rows=1000001 loops=1)
Sort Key: c1, (COALESCE(c4, '000'::character varying)), c3
Sort Method: external merge Disk: 52912kB
-> WindowAgg (cost=193152.96..215652.98 rows=1000001 width=91)
(actual time=1268.373..2271.463 rows=1000001 loops=1)
-> Sort (cost=193152.96..195652.96 rows=1000001 width=59)
(actual time=1268.357..1711.526 rows=1000001 loops=1)
Sort Key: c1, c2, c4
Sort Method: external merge Disk: 46168kB
-> Seq Scan on t (cost=0.00..18294.01 rows=1000001
width=59) (actual time=0.014..163.347 rows=1000001 loops=1)
Planning Time: 0.081 ms
Execution Time: 4250.367 ms
However the index is used if one of the partition by + order by matches
entirely the beginning of the index. For instance query:
select
c1,
first_value(c2) OVER (PARTITION BY c1 order by c2) AS c2,
first_value(c3) OVER (PARTITION BY c1 order by coalesce(c4, '000'), c3)
AS c3
from
t;
Uses covering_index to avoid the sort on c1, c2
On Tue, Jul 7, 2020 at 12:48 AM David G. Johnston <
david.g.johnston@gmail.com> wrote:
Show quoted text
On Monday, July 6, 2020, Sebastien Arod <sebastien.arod@gmail.com> wrote:
I would have expected postgresql to "share" a preliminary sort on c1 that
would then be useful to reduce the work on all window functions but it
doesn't.The plan shown does share - the output of one sort goes into another.
Subsequent sorts still have to happen but they should be faster as the
first field is already grouped. Doesn’t change the plan though.I even created an index on c1 hoping that postgresql would be able to use
it in order to minimize the cost of the sorts but I couldn't make it use it.Use it how? You are still evaluating 250k groups so doing anything
piece-wise seems like an expected loss compared to sequential scan.David J.
On Monday, July 6, 2020, Michael Lewis <mlewis@entrata.com> wrote:
Did you say you have an index on c1?
[...]
I don't know the data, but I assume there may be many rows with the same
c1 value, so then you would likely benefit from getting that distinct set
first like below as your FROM table.
Re-reading the original email I see both the answer to your question and
the data being queried.
David J.
Thanks David. I meant it as a rhetorical question, since yes of course
there was an index. I also didn't trust the example to be true to real data
in terms of c1 values distribution.
On Tue, Jul 7, 2020 at 9:01 AM Sebastien Arod <sebastien.arod@gmail.com>
wrote:
Michael, David thanks for your quick replies.
*@Michael*
I initially dismissed writing this query using joins or subselects because
the real query has about 80 columns and I was afraid that having 80
joins/subselect would cause issues with postgresql including planner that
would fallback to GEQO.
I'll test it anyway with real data and see how it behaves.
Contrived and overly simplified examples often lead to uninformed, bad
advice. I would not attempt joins, unless the number of distinct c1 values
is relatively small perhaps. It might go fine though, and depending on your
query and the statistics on the table, perhaps join_collapse_limit = 1
would be prudent to constrain the planner to your desired plan and not
introduce the chance for the genetic optimizer to get involved.
Sort Method: external merge Disk: 52912kB
Sort Method: external merge Disk: 46168kB
What is your work_mem set to? Would it be possible to set it higher (for
this process) to avoid spilling to disk?