Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

Started by Behrang Saeedzadehabout 12 years ago6 messagesgeneral
Jump to latest
#1Behrang Saeedzadeh
behrangsa@gmail.com

Hi,

I just stumbled upon this article from 2012 [1]http://use-the-index-luke.com/sql/partial-results/window-functions, according to which
(emphasis mine):

Window functions offer yet another way to implement pagination in SQL. This
is a flexible, and above all, standards-compliant method. However, only SQL
Server and the Oracle database can use them for a pipelined top-N
query. *PostgreSQL
does not use indexes for those queries and therefore executes them very
inefficiently.* MySQL does not support window functions at all.

Is this still the case? Or is PostgreSQL 9.3 capable to execute suchlike
queries efficiently?

[1]: http://use-the-index-luke.com/sql/partial-results/window-functions

Best regards,
Behrang Saeedzadeh

#2Thomas Kellerer
spam_eater@gmx.net
In reply to: Behrang Saeedzadeh (#1)
Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

Behrang Saeedzadeh, 15.02.2014 02:35:

Hi,

I just stumbled upon this article from 2012 [1], according to which
(emphasis mine):

Window functions offer yet another way to implement pagination in
SQL. This is a flexible, and above all, standards-compliant method.
However, only SQL Server and the Oracle database can use them for a
pipelined top-N query. */PostgreSQL does not use indexes for those
queries and therefore executes them very inefficiently./* MySQL does
not support window functions at all.

Is this still the case? Or is PostgreSQL 9.3 capable to execute
suchlike queries efficiently?

[1] http://use-the-index-luke.com/sql/partial-results/window-functions

My local Postgres 9.3 installation does use an index for such a query.

I ran a quick (an un-scientific) test on a sample table filled with auto-generated test data:

postgres=> \d+ products
Table "public.products"
Column | Type | Modifiers | Storage | Stats target | Description
-------------------+------------------------+-----------+----------+--------------+-------------
product_id | integer | not null | plain | |
ean_code | bigint | not null | plain | |
product_name | character varying(100) | not null | extended | |
manufacturer_name | character varying | not null | extended | |
price | numeric(10,2) | not null | main | |
publish_date | date | not null | plain | |
Indexes:
"products_pkey" PRIMARY KEY, btree (product_id)
"idx_publish_date" btree (publish_date, product_id)
Has OIDs: no

postgres=> select count(*) from products;
count
---------
1000000
(1 row)

Then I tried the following statement:

select *
from (
select products.*,
row_number() over (order by publish_date, product_id) as rn
from products
) tmp
where rn between 200 and 300
order by publish_date, product_id;

http://explain.depesz.com/s/5u9

And Postgres does use the index idx_publish_date.

Interesting enough: my local Oracle 11.2 does *not* use an index scan for the above test (same test data).

On the other hand Oracle's table scan is much faster (about ~0.5 seconds) for the first "pages" but than gets slower when increasing the limits of the pagincation.

Oracle takes over 5 seconds when changing the limit to "between 900000 and 900100" whereas Postgres execution time pretty much stays the same.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Merlin Moncure
mmoncure@gmail.com
In reply to: Behrang Saeedzadeh (#1)
Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

On Fri, Feb 14, 2014 at 7:35 PM, Behrang Saeedzadeh <behrangsa@gmail.com> wrote:

Hi,

I just stumbled upon this article from 2012 [1], according to which
(emphasis mine):

Window functions offer yet another way to implement pagination in SQL. This
is a flexible, and above all, standards-compliant method. However, only SQL
Server and the Oracle database can use them for a pipelined top-N query.
PostgreSQL does not use indexes for those queries and therefore executes
them very inefficiently. MySQL does not support window functions at all.

Is this still the case? Or is PostgreSQL 9.3 capable to execute suchlike
queries efficiently?

oracle:
SELECT *
FROM ( SELECT sales.*
, ROW_NUMBER() OVER (ORDER BY sale_date DESC
, sale_id DESC) rn
FROM sales
) tmp
WHERE rn between 11 and 20
ORDER BY sale_date DESC, sale_id DESC;

postgres:
SELECT * FROM sales s
WHERE (sale_date, sale_id) < (last_date, last_Id)
ORDER BY sale_date DESC, sale_id DESC
LIMIT 10;

The postgres variant is superior in my opinion (it will be faster for
large offsets). last_date, last_id are the lowest values you
previously read off. It will use an index on those two columns if you
have one. One interesting distinction is that the postgres variant
will always move forward while the oracle variant can appear to move
backwards if you are doing a non transactional scan.

Also, you can always use a cursor in either database.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Markus Winand
markus.winand@winand.at
In reply to: Merlin Moncure (#3)
Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

Hi!

I'd like to clarify this as the original author of the page in question.

Fist of all, I also recommend the row-value syntax as you can see on the "previous" page:
http://use-the-index-luke.com/sql/partial-results/fetch-next-page

I've also explained this procedure at conferences. Here are the slides:
http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way

And now about this quote:

However, only SQL Server and the Oracle database can use them for a pipelined top-N query. PostgreSQL does not use indexes for those queries and therefore executes them very inefficiently.

The very important thing here is "pipelined top-N query". This term is introduced two pages earlier:
http://use-the-index-luke.com/sql/partial-results/top-n-queries

The pipelined top-n query has two very important properties:
(1) it utilizes the index order to avoid the sort operation required to satisfy ORDER BY clause
(2) it realizes that it can stop processing as soon as it has delivered enough rows.

The execution plan from Thomas Kellerer sees to fulfill requirement (1) but definitively not (2).

Even with 9.3.2, I were not able to reproduce the result of Thomas (not showing any sort operation in the execution plan) with the test data I also published at my website:
http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

Then I started fiddling around with the planner's cost settings and finally managed to get a plan similar to Thomas' when setting random_page_cost to 0.1 (setting it to 1, equal to seq_page_cost was not enough). However, that proves the point that PostgreSQL can use an index to avoid the sort operation caused by order by (even for window functions). I'd be curious what settings caused Thomas to get this result.

The second requirement for pipelined top-n queries is not satisfied in Thomas' execution plan: it does read the full index (actual rows=1000000), and applies the window function over all rows. Only in the end it throws away all non-confirming rows (Rows Removed by Filter: 999899). A pipelined top-n execution would not cause more than 300 rows read from the index, and only 200 rows removed by filter. That's what the Oracle DB and SQL Server manage to do (Thomas: ping me to track down why Oracle didn't for you). Considering that PG can use the index order to avoid the sort, it still doesn't make very much sense if it cannot abort the index scan after fetching enough rows. So, not using the index might even be right choice in unfiltered cases like this.

Interestingly, I did not even get an Index Only Scan when just selecting column from the index with random_page_cost=1 and seq_page_cost=1. Again I had to reduce random_page_cost further down (0.1) to get an Index Only Scan. That does not make sense to me. When both _page_costs are same, the Index Only Scan should get lower costs because the index is smaller (338mb vs. 31 mb in my case). On top of that, the Index Only Scan avoids the sort. However, with 9.3.2 I get the same cost for Index Only Scan as for Index Scan (had to enable_seqscan=off and enable_bitmapscan=off to get that).

So, I have to change my page (&book) to say something like this:

PostgreSQL does not abort the index scan after fetching enough rows for those queries and therefore executes them very inefficiently.

Thanks for the hint and always feel free to put my on CC regarding questions about stuff on Use The Index, Luke!

-markus

ps.: It's perfectly possible that PG could use indexes for window-functions before 9.3. I did definitively not fiddle around with cost settings at that time to force it into this plan.
pps.: sorry for the delay, I'm not subscribed (just too much) but somebody was nice enough to ping me about this.
ppps: then I wondered how to properly reply without having the original messages. So I downloaded the .mbox from the archive and pushed reply there. Hope it ends up in the right thread :)

Markus Winand
markus.winand@winand.at
T +43 1 9444047

"A wonderful book…I highly recommend it." -Anders Janmyr
http://sql-performance-explained.com/

Maderspergerstr. 1-3/9/11
1160 Wien
AUSTRIA

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Markus Winand (#4)
Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

2014-03-14 13:02 GMT+01:00 Markus Winand <markus.winand@winand.at>:

Hi!

I'd like to clarify this as the original author of the page in question.

Fist of all, I also recommend the row-value syntax as you can see on the
"previous" page:
http://use-the-index-luke.com/sql/partial-results/fetch-next-page

I've also explained this procedure at conferences. Here are the slides:

http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way

And now about this quote:

However, only SQL Server and the Oracle database can use them for a

pipelined top-N query. PostgreSQL does not use indexes for those queries
and therefore executes them very inefficiently.

The very important thing here is "pipelined top-N query". This term is
introduced two pages earlier:
http://use-the-index-luke.com/sql/partial-results/top-n-queries

The pipelined top-n query has two very important properties:
(1) it utilizes the index order to avoid the sort operation required to
satisfy ORDER BY clause
(2) it realizes that it can stop processing as soon as it has delivered
enough rows.

The execution plan from Thomas Kellerer sees to fulfill requirement (1)
but definitively not (2).

Even with 9.3.2, I were not able to reproduce the result of Thomas (not
showing any sort operation in the execution plan) with the test data I also
published at my website:

http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

Then I started fiddling around with the planner's cost settings and
finally managed to get a plan similar to Thomas' when setting
random_page_cost to 0.1 (setting it to 1, equal to seq_page_cost was not
enough). However, that proves the point that PostgreSQL can use an index to
avoid the sort operation caused by order by (even for window functions).
I'd be curious what settings caused Thomas to get this result.

The second requirement for pipelined top-n queries is not satisfied in
Thomas' execution plan: it does read the full index (actual rows=1000000),
and applies the window function over all rows. Only in the end it throws
away all non-confirming rows (Rows Removed by Filter: 999899). A pipelined
top-n execution would not cause more than 300 rows read from the index, and
only 200 rows removed by filter. That's what the Oracle DB and SQL Server
manage to do (Thomas: ping me to track down why Oracle didn't for you).
Considering that PG can use the index order to avoid the sort, it still
doesn't make very much sense if it cannot abort the index scan after
fetching enough rows. So, not using the index might even be right choice in
unfiltered cases like this.

Interestingly, I did not even get an Index Only Scan when just selecting
column from the index with random_page_cost=1 and seq_page_cost=1. Again I
had to reduce random_page_cost further down (0.1) to get an Index Only
Scan. That does not make sense to me. When both _page_costs are same, the
Index Only Scan should get lower costs because the index is smaller (338mb
vs. 31 mb in my case). On top of that, the Index Only Scan avoids the sort.
However, with 9.3.2 I get the same cost for Index Only Scan as for Index
Scan (had to enable_seqscan=off and enable_bitmapscan=off to get that).

So, I have to change my page (&book) to say something like this:

PostgreSQL does not abort the index scan after fetching enough rows for

those queries and therefore executes them very inefficiently.

I am thinking so LIMIT is not propagated to down - so window function
should be calculated in full range.

Regards

Pavel

Show quoted text

Thanks for the hint and always feel free to put my on CC regarding
questions about stuff on Use The Index, Luke!

-markus

ps.: It's perfectly possible that PG could use indexes for
window-functions before 9.3. I did definitively not fiddle around with cost
settings at that time to force it into this plan.
pps.: sorry for the delay, I'm not subscribed (just too much) but somebody
was nice enough to ping me about this.
ppps: then I wondered how to properly reply without having the original
messages. So I downloaded the .mbox from the archive and pushed reply
there. Hope it ends up in the right thread :)

Markus Winand
markus.winand@winand.at
T +43 1 9444047

"A wonderful book…I highly recommend it." -Anders Janmyr
http://sql-performance-explained.com/

Maderspergerstr. 1-3/9/11
1160 Wien
AUSTRIA

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Thomas Kellerer
spam_eater@gmx.net
In reply to: Markus Winand (#4)
Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

Markus,

thanks for your reply.

The pipelined top-n query has two very important properties:
(1) it utilizes the index order to avoid the sort operation required to satisfy ORDER BY clause
(2) it realizes that it can stop processing as soon as it has delivered enough rows.

The execution plan from Thomas Kellerer sees to fulfill requirement
(1) but definitively not (2).

Even with 9.3.2, I were not able to reproduce the result of Thomas
(not showing any sort operation in the execution plan) with the test
data I also published at my website:
http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

Then I started fiddling around with the planner's cost settings and
finally managed to get a plan similar to Thomas' when setting
random_page_cost to 0.1 (setting it to 1, equal to seq_page_cost was
not enough). However, that proves the point that PostgreSQL can use
an index to avoid the sort operation caused by order by (even for
window functions). I'd be curious what settings caused Thomas to get
this result.

Good point, I should adopt the habit to mention config settings for this kind of things:

shared_buffers = 2048MB
temp_buffers = 16MB
work_mem = 32MB
seq_page_cost = 1.0
random_page_cost = 1.5
cpu_tuple_cost = 0.001
cpu_index_tuple_cost = 0.001
effective_cache_size = 2048MB

The second requirement for pipelined top-n queries is not satisfied
in Thomas' execution plan: it does read the full index (actual
rows=1000000), and applies the window function over all rows. Only in
the end it throws away all non-confirming rows (Rows Removed by
Filter: 999899). A pipelined top-n execution would not cause more
than 300 rows read from the index, and only 200 rows removed by
filter. That's what the Oracle DB and SQL Server manage to do
(Thomas: ping me to track down why Oracle didn't for you).
Considering that PG can use the index order to avoid the sort, it
still doesn't make very much sense if it cannot abort the index scan
after fetching enough rows. So, not using the index might even be
right choice in unfiltered cases like this.

I probably should have read the definition of "index usage" more carefully, thanks for the clarification.

I created the testdata from your webpage locally and then ran the window function statement from
http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

SELECT *
FROM ( SELECT sales.*
, ROW_NUMBER() OVER (ORDER BY sale_date DESC
, sale_id DESC) rn
FROM sales
) tmp
WHERE rn between 11 and 20
ORDER BY sale_date DESC, sale_id DESC;

This does an "Index Scan Backward" on an index defined as (sale_date, sale_id) but takes nearly 2.5 seconds on my computer.

The version using OFFSET .. LIMIT took about 0.05 seconds and increases when the offset increases which is to be expected - whereas the window function version is pretty much constant even with a startpoint of 100001 - but the offset version is still *much* faster then.

Regards
Thomas

P.S.: Btw: thanks for your book and your webpage, both are very insipring.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general