Windowing Function Patch Review -> Performance Comparison.
All,
This is my first patch review for PostgreSQL. I did submit a patch last
commit fest (Boyer-Moore) so I feel I should review one this commit fest.
I'm quite new to PostgreSQL so please don't rely on me totally. I'll do my
best. Heikki is also reviewing this patch which makes me feel better.
My aim is to get the author has much feed back as quickly as possible. For
this reason I'm going to be breaking down my reviews into the following
topics.
1. Does patch apply cleanly?
2. Any compiler warnings?
3. Do the results follow the SQL standard?
4. Performance Comparison, does it perform better than alternate ways of
doing things. Self joins, sub queries etc.
5. Performance, no comparison. How does it perform with larger tables?
Things I probably won't attempt to review:
Source code; best practises, making use of existing APIs etc. I'd rather
leave that for Heikki and possibly others that join in reviewing this patch.
It's not that I'm too lazy, just that I don't feel that I know the source
well enough. Plus it's a complex patch.
Really I should follow my list in order but I'm going to do number 4 first
in order to get some quick feedback to the author.
I've created some "real world" tests where windowing functions will be
useful. I created some tables then populated with data. I then wrote 2
queries; 1 to make use of windowing functions, the other that uses a method
without windowing functions.
Test Results:
Test Normal Windowing UOM Increase %
Test 1 498.00 578.00 Trans/Sec 16.06%
Test 2 336.00 481.00 Trans/Sec 43.15%
Test 3 1.30 8.45 Trans/Sec 550.00%
Test 4 424.00 629.00 Trans/Sec 48.35%
Test 5 8.89 31052.69 Trans/Hour 349114.85%
Test 6 253.00 305.00 Trans/Sec 20.55%
(Please see attached document for the actual tests)
Note: The above results will much depend on the set of data. Most of my
tests use a very small volume of data. Test 3 and 5 use more data that the
other tests. It's quite obvious that the more data there is in my tests the
bigger the margin between the two methods becomes. I originally ran test 3
with 40000 rows to simulate a large marathon but the "normal query" was
going to take hours... I reduced the rows to 10000.
Obervations:
Test 3 and 5 did not seem to make use of an index to get a sorted list of
results. I disabled enable_seqscan but the planner still failed to choose
index_scan. Is there any reason for this? Perhaps I'm missing something.
Hitoshi, can you take a look at this?
Tests:
Please see attached file. Perhaps there were more efficient ways for certain
queries, I just couldn't think of them...
Please let me know if you feel I should be conducting the review in another
way.
David.
Attachments:
Here is another way to solve "big marathon" without window functions (and
many other kinds of "windowing" queries, especially those that do not
specify "rows preceeding" etc.).
It could be considered as a very dirty hack, however it could give you an
insight on the performance of the "windowed" query with indexscan instead of
seqscan.
create function var_set (text,text) returns text
as
'
select set_config (''public.''||$2||pg_backend_pid(), $1, false);
' LANGUAGE 'sql';
create function var_get (text) returns text
as
'
select current_setting(''public.''||$1||pg_backend_pid());
' LANGUAGE 'sql';
create operator >>> (procedure = var_set, leftarg = text, rightarg = text);
create operator <<< (procedure = var_get, rightarg = text);
-- init values
select ''>>>'prev_time', '0'>>>'dense_rank';
-- marathon query
select *
from (
select (((case when time::text = <<<'prev_time' then *0* else *1*
end)+(<<<'dense_rank')::int4)::text>>>'dense_rank')::int4 as position,
runnerid, time
from big_marathon
order by time
) results
where position=*2*
Best regards,
Vladimir Sitnikov
Just a small correction: there should be time::text>>>'prev_time' for the
calculations to be correct:
select *
from (
select (((case when time::text = <<<'prev_time' then *0* else *1*
end)+(<<<'dense_rank')::int4)::text>>>'dense_rank')::int4 as position,
runnerid, time, time::text>>>'prev_time'
from big_marathon
order by time
) results
where position=*2*
-- meter_readings
select '' >>> 'lag';
select date, reading::numeric-(case lag when '' then null else lag
end)::numeric as used
from (
select date, <<<'lag' as lag, reading::text >>>'lag' as reading
from meter_readings
order by date
) as t
order by used asc nulls last limit *1*
Best regards,
Vladimir Sitnikov
2008/11/2 David Rowley <dgrowley@gmail.com>:
Obervations:
Test 3 and 5 did not seem to make use of an index to get a sorted list of
results. I disabled enable_seqscan but the planner still failed to choose
index_scan. Is there any reason for this? Perhaps I'm missing something.
Hitoshi, can you take a look at this?
Ah, good point. Maybe it's because I haven't paid attention to choose
index_scan for upper sort node. I just put the sort node whatever the
downer node is, so it might be needed to sink the information down to
scan choice process that we use sort node upper. Could someone point
me out how to do it, or which part of the existing code would be a
good guide?
Tests:
Please see attached file. Perhaps there were more efficient ways for certain
queries, I just couldn't think of them...Please let me know if you feel I should be conducting the review in another
way.
Thanks for your test. Didn't post publicly, I've also tested real
problems and performed better than I thought. If you can afford it,
could you add selfjoin cases? It's like:
-- normal
SELECT t1.id, t1.grp, count(t2) + 1 AS row_number
FROM t t1
INNER JOIN t t2 ON t1.grp = t2.grp AND t1.id > t2.id;
-- windowing
SELECT id, grp, row_number() OVER (PARTITION grp ORDER BY id)
FROM t;
Regards,
--
Hitoshi Harada
Hitoshi Harada Wrote:
Thanks for your test. Didn't post publicly, I've also tested real
problems and performed better than I thought. If you can afford it,
could you add selfjoin cases? It's like:
Ok, did self joins with some. I don't know if it's possible with all.
Test Sub query Self join Vladimir Windowing UOM Increase %
Test 1 498.00 N/A N/A 578.00 Trans/Sec 16.06%
Test 2 336.00 424.00 182.78 481.00 Trans/Sec 13.44%
Test 3 1.30 7.59 1.90 8.45 Trans/Sec 11.33%
Test 4 424.00 361.00 182.00 629.00 Trans/Sec 48.35%
Test 5 8.89 N/A 5844.16 31052.69 Trans/Hour 431.35%
Test 6 253.00 N/A N/A 305.00 Trans/Sec 20.55%
See attached for details.
The increase % column is now:
Window / max ( Sub query, self join, Vladimir ) - 1
Vladimir, I've included your tests too. I understand that people will
probably use this method as sometimes there is little choice to get the
performance that is required.
Hitoshi Harada Wrote:
2008/11/2 David Rowley <dgrowley@gmail.com>:
Obervations:
Test 3 and 5 did not seem to make use of an index to get a sorted list
of
results. I disabled enable_seqscan but the planner still failed to
choose
index_scan. Is there any reason for this? Perhaps I'm missing something.
Hitoshi, can you take a look at this?
Ah, good point. Maybe it's because I haven't paid attention to choose
index_scan for upper sort node. I just put the sort node whatever the
downer node is, so it might be needed to sink the information down to
scan choice process that we use sort node upper. Could someone point
me out how to do it, or which part of the existing code would be a
good guide?
I know you need to wait for an answer about this, so I'd like to delay any
further performance tests until that's sorted out as it should affect
performance of larger tables quite a bit.
David.
Attachments:
2008/11/2 David Rowley <dgrowley@gmail.com>:
Hitoshi Harada Wrote:
2008/11/2 David Rowley <dgrowley@gmail.com>:
Obervations:
Test 3 and 5 did not seem to make use of an index to get a sorted list
of
results. I disabled enable_seqscan but the planner still failed to
choose
index_scan. Is there any reason for this? Perhaps I'm missing something.
Hitoshi, can you take a look at this?Ah, good point. Maybe it's because I haven't paid attention to choose
index_scan for upper sort node. I just put the sort node whatever the
downer node is, so it might be needed to sink the information down to
scan choice process that we use sort node upper. Could someone point
me out how to do it, or which part of the existing code would be a
good guide?I know you need to wait for an answer about this, so I'd like to delay any
further performance tests until that's sorted out as it should affect
performance of larger tables quite a bit.
I found how to do it, though it's only on the case you gave. Thinking
about the planner optimization of the Window nodes (and its attached
Sort nodes), we must consider the execution order of more than one
node. In the test case we only take care of only one window, but there
may be more window/sort node sets, which is too difficult to choose
the best execution order including the downer indexscan, mergejoin in
subquery and sort-based GROUP BY. So I didn't touch the complicated
planner jungle. I rewrote the patch so that only the given bottom
window's sort can consider indexscan. Deeper optimizations are over my
capability.
Attach is a delta patch against the last one. Also see the git diff:
http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=bbba638f721a7e1d11cb3ee6af3bc1d7d3c11aa8;hp=48b73ee574779a14a3c36d373d8544d59a5b8b46
Regards,
--
Hitoshi Harada
Attachments:
window_functions.delta.patch.20081103application/octet-stream; name=window_functions.delta.patch.20081103Download
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 1479,1484 **** _outPlannerInfo(StringInfo str, PlannerInfo *node)
--- 1479,1485 ----
WRITE_NODE_FIELD(group_pathkeys);
WRITE_NODE_FIELD(distinct_pathkeys);
WRITE_NODE_FIELD(sort_pathkeys);
+ WRITE_NODE_FIELD(window_pathkeys);
WRITE_FLOAT_FIELD(total_table_pages, "%.0f");
WRITE_FLOAT_FIELD(tuple_fraction, "%.4f");
WRITE_BOOL_FIELD(hasJoinRTEs);
*** a/src/backend/optimizer/plan/planmain.c
--- b/src/backend/optimizer/plan/planmain.c
***************
*** 233,238 **** query_planner(PlannerInfo *root, List *tlist,
--- 233,239 ----
*/
root->query_pathkeys = canonicalize_pathkeys(root, root->query_pathkeys);
root->group_pathkeys = canonicalize_pathkeys(root, root->group_pathkeys);
+ root->window_pathkeys = canonicalize_pathkeys(root, root->window_pathkeys);
root->distinct_pathkeys = canonicalize_pathkeys(root, root->distinct_pathkeys);
root->sort_pathkeys = canonicalize_pathkeys(root, root->sort_pathkeys);
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 902,907 **** grouping_planner(PlannerInfo *root, double tuple_fraction)
--- 902,948 ----
else
root->group_pathkeys = NIL;
+ /*
+ * Currently we don't relocate each Window node based on
+ * cost estimation; it'd be better to think about the order
+ * of each node execution. But for now we only think about
+ * the bottom node pathkeys. This should be fixed.
+ */
+ if (parse->windowList)
+ {
+ ListCell *l;
+
+ foreach(l, parse->windowList)
+ {
+ List *partition_pathkeys = NIL;
+ List *order_pathkeys = NIL;
+ WindowClause *wc = (WindowClause *) lfirst(l);
+
+ if (wc->partitionClause &&
+ grouping_is_sortable(wc->partitionClause))
+ partition_pathkeys =
+ make_pathkeys_for_sortclauses(root,
+ wc->partitionClause,
+ tlist,
+ false);
+
+ if (wc->orderClause &&
+ grouping_is_sortable(wc->orderClause))
+ order_pathkeys =
+ make_pathkeys_for_sortclauses(root,
+ wc->orderClause,
+ tlist,
+ false);
+
+ root->window_pathkeys = list_concat(partition_pathkeys, order_pathkeys);
+ /*
+ * Window node may be stacked more than one, but
+ * what is effective to query_planner() is only the bottom pathkeys.
+ */
+ break;
+ }
+ }
+
if (parse->distinctClause &&
grouping_is_sortable(parse->distinctClause))
root->distinct_pathkeys =
***************
*** 954,959 **** grouping_planner(PlannerInfo *root, double tuple_fraction)
--- 995,1002 ----
*/
if (root->group_pathkeys)
root->query_pathkeys = root->group_pathkeys;
+ else if (root->window_pathkeys)
+ root->query_pathkeys = root->window_pathkeys;
else if (list_length(root->distinct_pathkeys) >
list_length(root->sort_pathkeys))
root->query_pathkeys = root->distinct_pathkeys;
***************
*** 1277,1283 **** grouping_planner(PlannerInfo *root, double tuple_fraction)
continue;
/*
! * Currently, Window Partitioning strategy is only by Sort.
* So just join partitionClause and orderClause
* to match Grouping. Hashing algorithm will be considered later.
*/
--- 1320,1326 ----
continue;
/*
! * Currently, Window partitioning is only by Sort.
* So just join partitionClause and orderClause
* to match Grouping. Hashing algorithm will be considered later.
*/
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 174,179 **** typedef struct PlannerInfo
--- 174,181 ----
List *distinct_pathkeys; /* distinctClause pathkeys, if any */
List *sort_pathkeys; /* sortClause pathkeys, if any */
+ List *window_pathkeys; /* pathkeys of bottom Window, if any */
+
List *initial_rels; /* RelOptInfos we are now trying to join */
MemoryContext planner_cxt; /* context holding PlannerInfo */
Hitoshi Harada wrote:
Test 3 and 5 did not seem to make use of an index to get a sorted
list
of
results. I disabled enable_seqscan but the planner still failed to
choose
index_scan. Is there any reason for this? Perhaps I'm missing
something.
Hitoshi, can you take a look at this?
Ah, good point. Maybe it's because I haven't paid attention to choose
index_scan for upper sort node. I just put the sort node whatever the
downer node is, so it might be needed to sink the information down to
scan choice process that we use sort node upper. Could someone point
me out how to do it, or which part of the existing code would be a
good guide?I know you need to wait for an answer about this, so I'd like to delay
any
further performance tests until that's sorted out as it should affect
performance of larger tables quite a bit.I found how to do it, though it's only on the case you gave. Thinking
about the planner optimization of the Window nodes (and its attached
Sort nodes), we must consider the execution order of more than one
node. In the test case we only take care of only one window, but there
may be more window/sort node sets, which is too difficult to choose
the best execution order including the downer indexscan, mergejoin in
subquery and sort-based GROUP BY. So I didn't touch the complicated
planner jungle. I rewrote the patch so that only the given bottom
window's sort can consider indexscan. Deeper optimizations are over my
capability.
I've just looked into what some other implementations do. Sybase seems to do
exactly what you've done. It only looks at the first window clause in the
query. Oracle seems to use the index regardless to the position of the
window clause. To me personally what you've done seems fine for now. Perhaps
something could be done later to improve on this. Maybe someone else has
ideas about how to do it?
It seems quite similar to SELECT MAX(idxcol),MAX(idxcol2) where the planner
often makes use of 2 indexes when available, yet this case is probably far
more simple as there is always just 1 row. Costing would likely be more
complex with the windowing functions version.
Good work.
I'll continue with more benchmarks soon.
David.
Hitoshi Harada wrote:
I found how to do it, though it's only on the case you gave. Thinking
about the planner optimization of the Window nodes (and its attached
Sort nodes), we must consider the execution order of more than one
node. In the test case we only take care of only one window, but there
may be more window/sort node sets, which is too difficult to choose
the best execution order including the downer indexscan, mergejoin in
subquery and sort-based GROUP BY. So I didn't touch the complicated
planner jungle. I rewrote the patch so that only the given bottom
window's sort can consider indexscan. Deeper optimizations are over my
capability.
Sorry for the delay on this. I've updated the benchmark results using the
new patch you sent today. I did a dump and re-load of the tables, since some
of the numbers are randomly generated I wouldn't want to compare them to the
old results for any of the tests. This is a complete new list with the CVS
head as of this morning.
Test Sub Query Self Join Vladimir Windowing UOM Window over Best alt
Test 1 504 N/A N/A 568 TPS 12.7%
Test 2 340.9 425 182 450.38 TPS 6.0%
Test 3 1.304 8.12 1.963 7.52 TPS -7.4%
Test 4 422 365 195 630 TPS 49.3%
Test 5 8.874 N/A 5825 31203 TPH 435.6%
Test 6 251 N/A N/A 300 TPS 19.5%
Only test 3 and 5 made use of the index scan, performance dropped slightly
on test 3 but there's not much point in paying much attention to that since
we're probably close to the cross over point between a seqscan and indexscan
where the planner's decision is not as critical.
Certain Self join methods I used don't implement the exact requirements I've
stated at the top of the test. For example the meter reading for self join
requires no days to be missed.
Maybe multi window optimisation is one for 8.5's TODO
I've attached the test scripts.
David.
Attachments:
Hitoshi Harada wrote:
I found how to do it, though it's only on the case you gave. Thinking
about the planner optimization of the Window nodes (and its attached
Sort nodes), we must consider the execution order of more than one
node. In the test case we only take care of only one window, but there
may be more window/sort node sets, which is too difficult to choose
the best execution order including the downer indexscan, mergejoin in
subquery and sort-based GROUP BY. So I didn't touch the complicated
planner jungle. I rewrote the patch so that only the given bottom
window's sort can consider indexscan. Deeper optimizations are over my
capability.
After more playing around with a few queries and testing some performance of
larger tables. I discovered something strange in the plan for this query.
david=# explain select date,lag(date,1) over (order by date) from
meter_Readings order by date;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
Sort (cost=1038.73..1063.74 rows=10001 width=4)
Sort Key: date
-> Window (cost=0.00..374.27 rows=10001 width=4)
-> Index Scan using meter_readings_pkey on meter_readings
(cost=0.00..299.27 rows=10001 width=4)
(4 rows)
Is the final sort still required? Is it not already sorted in the window?
The table I was testing on split the sort to disk, I would probably not have
noticed otherwise.
David.
2008/11/10 David Rowley <dgrowley@gmail.com>:
Hitoshi Harada wrote:
I found how to do it, though it's only on the case you gave. Thinking
about the planner optimization of the Window nodes (and its attached
Sort nodes), we must consider the execution order of more than one
node. In the test case we only take care of only one window, but there
may be more window/sort node sets, which is too difficult to choose
the best execution order including the downer indexscan, mergejoin in
subquery and sort-based GROUP BY. So I didn't touch the complicated
planner jungle. I rewrote the patch so that only the given bottom
window's sort can consider indexscan. Deeper optimizations are over my
capability.After more playing around with a few queries and testing some performance of
larger tables. I discovered something strange in the plan for this query.david=# explain select date,lag(date,1) over (order by date) from
meter_Readings order by date;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
Sort (cost=1038.73..1063.74 rows=10001 width=4)
Sort Key: date
-> Window (cost=0.00..374.27 rows=10001 width=4)
-> Index Scan using meter_readings_pkey on meter_readings
(cost=0.00..299.27 rows=10001 width=4)
(4 rows)Is the final sort still required? Is it not already sorted in the window?
Oh, I forgot to mention about it. This behavior is also fixed and it
works without sort on the window now. I don't remember at all why I
did so and there's no comment around that but regression tests showed
there is no preblem without it.
Regards,
--
Hitoshi Harada
Hitoshi Harada wrote:
david=# explain select date,lag(date,1) over (order by date) from
meter_Readings order by date;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
Sort (cost=1038.73..1063.74 rows=10001 width=4)
Sort Key: date
-> Window (cost=0.00..374.27 rows=10001 width=4)
-> Index Scan using meter_readings_pkey on meter_readings
(cost=0.00..299.27 rows=10001 width=4)
(4 rows)Is the final sort still required? Is it not already sorted in the
window?
Oh, I forgot to mention about it. This behavior is also fixed and it
works without sort on the window now. I don't remember at all why I
did so and there's no comment around that but regression tests showed
there is no preblem without it.Regards,
--
Hitoshi Harada
Fantastic news! That will speed up the few test queries I have quite a bit
the sort was splitting to disk so performance was dropping quite a bit. This
might be a good time to say that the hardware that I'm testing on is ultra
slow. 600 Mhz Pentium III with only 28MB shared buffers.
I've done some more performance tests with the patch. Realising that I
really need to be comparing the performance with something else I decided to
have a query process a large table with then without a windowing function
just to see how much the query slows. Of course there is going to be an
overhead using a tuple store. I just wanted to see how much. My results are
not really very interesting, so I'll just include them in the bottom of the
email for those who want to see.
Casting my mind back to when the patch was always doing a sort before a
window with an order by even when a btree index was there, it's really come
a long way. I've little idea how difficult it would be to implement better
planning for the following. I suppose if it's difficult then it's probably
better to wait for the patch to be commited first, or maybe it's something
for 8.5.
SELECT department,
SUM(Salary),
ROW_NUMBER() OVER (ORDER BY department),
SUM(SUM(salary)) OVER (ORDER BY department DESC)
FROM employees
GROUP BY department ORDER BY department;
This query performs more sorts than really is needed. I suppose the most
efficient way to process it would be to process the 2nd window first then
the 1st before returning the results in the same order as the 1st window.
Currently the plan looks like:
QUERY PLAN
----------------------------------------------------------------------------
-----------------
Sort (cost=1.33..1.34 rows=3 width=9)
Sort Key: department
-> Window (cost=1.25..1.31 rows=3 width=9)
-> Sort (cost=1.25..1.26 rows=3 width=9)
Sort Key: department
-> Window (cost=1.17..1.23 rows=3 width=9)
-> Sort (cost=1.17..1.18 rows=3 width=9)
Sort Key: department
-> HashAggregate (cost=1.10..1.15 rows=3
width=9)
-> Seq Scan on employees (cost=0.00..1.06
rows=6 width=9)
Maybe it's possible to sort the processing order of the windows based on the
ORDER BY clauses putting any that match the ORDER BY of the final results
last. I've not looked into this in much detail. Currently I cannot see any
scenario where it would be bad.
What do you think?
David
CREATE TABLE bigtable (
id SERIAL NOT NULL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL
);
-- about 383MB of data
INSERT INTO bigtable (timestamp)
SELECT NOW() + (CAST(RANDOM() * 10 AS INT) || ' secs')::INTERVAL
FROM generate_series(1,10000000);
CREATE INDEX bigtable_timestamp_idx ON bigtable (timestamp);
VACUUM ANALYZE bigtable;
-- base line test
david=# SELECT COUNT(*) FROM (select id,timestamp from bigtable order by id)
t;
count
----------
10000000
(1 row)
Time: 72862.238 ms
-- lag test
david=# SELECT COUNT(*) FROM (select id,LAG(timestamp,1) OVER (order by id)
from bigtable order by id) t;
count
----------
10000000
(1 row)
Time: 257713.350 ms
david=# SELECT COUNT(*) FROM (select id,NTILE(10) OVER (order by id) from
bigtable order by id) t;
count
----------
10000000
(1 row)
Time: 183131.425 ms
2008/11/15 David Rowley <dgrowley@gmail.com>:
Hitoshi Harada wrote:
david=# explain select date,lag(date,1) over (order by date) from
meter_Readings order by date;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
Sort (cost=1038.73..1063.74 rows=10001 width=4)
Sort Key: date
-> Window (cost=0.00..374.27 rows=10001 width=4)
-> Index Scan using meter_readings_pkey on meter_readings
(cost=0.00..299.27 rows=10001 width=4)
(4 rows)Is the final sort still required? Is it not already sorted in the
window?
Oh, I forgot to mention about it. This behavior is also fixed and it
works without sort on the window now. I don't remember at all why I
did so and there's no comment around that but regression tests showed
there is no preblem without it.Regards,
--
Hitoshi HaradaFantastic news! That will speed up the few test queries I have quite a bit
the sort was splitting to disk so performance was dropping quite a bit. This
might be a good time to say that the hardware that I'm testing on is ultra
slow. 600 Mhz Pentium III with only 28MB shared buffers.I've done some more performance tests with the patch. Realising that I
really need to be comparing the performance with something else I decided to
have a query process a large table with then without a windowing function
just to see how much the query slows. Of course there is going to be an
overhead using a tuple store. I just wanted to see how much. My results are
not really very interesting, so I'll just include them in the bottom of the
email for those who want to see.
Thanks for your test code. I attach the result of your test to which
another query is added. The added row_number() query has row buffer
strategy that doesn't hold tuplestore but only scans and returns rows.
So the difference between row_number(), 44 sec, and ntile(), 61 sec,
cases roughly shows how much tuplestore adds overhead. I had supposed
the row_number() case would be more efficient but still we can see it
works as an optimization.
Casting my mind back to when the patch was always doing a sort before a
window with an order by even when a btree index was there, it's really come
a long way. I've little idea how difficult it would be to implement better
planning for the following. I suppose if it's difficult then it's probably
better to wait for the patch to be commited first, or maybe it's something
for 8.5.
Yeah, the planner around group by, order, distinct, indexscan, and
window is quite complicated. Though I think I can manage to do it if
there left enough time, but at first the basic design should be
qualified by core hackers. I am waiting for subsequent review and
responses from Heikki and others.
SELECT department,
SUM(Salary),
ROW_NUMBER() OVER (ORDER BY department),
SUM(SUM(salary)) OVER (ORDER BY department DESC)
FROM employees
GROUP BY department ORDER BY department;This query performs more sorts than really is needed. I suppose the most
efficient way to process it would be to process the 2nd window first then
the 1st before returning the results in the same order as the 1st window.Currently the plan looks like:
QUERY PLAN
----------------------------------------------------------------------------
-----------------
Sort (cost=1.33..1.34 rows=3 width=9)
Sort Key: department
-> Window (cost=1.25..1.31 rows=3 width=9)
-> Sort (cost=1.25..1.26 rows=3 width=9)
Sort Key: department
-> Window (cost=1.17..1.23 rows=3 width=9)
-> Sort (cost=1.17..1.18 rows=3 width=9)
Sort Key: department
-> HashAggregate (cost=1.10..1.15 rows=3
width=9)
-> Seq Scan on employees (cost=0.00..1.06
rows=6 width=9)Maybe it's possible to sort the processing order of the windows based on the
ORDER BY clauses putting any that match the ORDER BY of the final results
last. I've not looked into this in much detail. Currently I cannot see any
scenario where it would be bad.What do you think?
The patch doesn't have infrastracture to relocate each window node
yet. When relocating them we must re-number wfunc->winref so it's a
bit work, though possible. Also, I can imagine to fix only above case
but I'm afraid without having great overall sketch of planner
side-effect would come up... For intance, what if the window query is
a subquery of group with sort strategy that assumes its subplan
doesn't sort. Maybe in that case upper group by doesn't notice the
sort key change done by window node relocating in subplan.
Regards,
--
Hitoshi Harada
$ uname -srpio
Linux 2.6.9-55.0.9.ELsmp i686 i386 GNU/Linux
# cpuinfo
Intel(R) Xeon(R) CPU X3210 @ 2.13GHz
sample=# explain analyze select id, timestamp from bigtable order by id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using bigtable_pkey on bigtable (cost=0.00..286808.13
rows=10000000 width=12) (actual time=0.021..11996.336 rows=10000000
loops=1)
Total runtime: 20924.200 ms
(2 rows)
sample=# explain analyze select id, row_number() OVER (order by id)
from bigtable order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Window (cost=0.00..361808.13 rows=10000000 width=4) (actual
time=0.080..35265.786 rows=10000000 loops=1)
-> Index Scan using bigtable_pkey on bigtable
(cost=0.00..286808.13 rows=10000000 width=4) (actual
time=0.075..13279.216 rows=10000000 loops=1)
Total runtime: 44067.815 ms
(3 rows)
sample=# explain analyze select id,LAG(timestamp,1) over (order by id)
from bigtable order by id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Window (cost=0.00..411808.13 rows=10000000 width=12) (actual
time=30240.715..70439.057 rows=10000000 loops=1)
-> Index Scan using bigtable_pkey on bigtable
(cost=0.00..286808.13 rows=10000000 width=12) (actual
time=0.077..15302.919 rows=10000000 loops=1)
Total runtime: 79658.314 ms
(3 rows)
sample=# explain analyze select id, ntile(10) OVER (order by id) from
bigtable order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Window (cost=0.00..386808.13 rows=10000000 width=4) (actual
time=25158.467..52250.418 rows=10000000 loops=1)
-> Index Scan using bigtable_pkey on bigtable
(cost=0.00..286808.13 rows=10000000 width=4) (actual
time=0.075..13088.061 rows=10000000 loops=1)
Total runtime: 61275.279 ms
(3 rows)