Avoid using index scan backward when limit order desc

Started by Christophe Escobarover 9 years ago3 messagesgeneral

Jump to latest

Christophe Escobar

christophe.esco@gmail.com

over 9 years ago

Hi,

I am using PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc
(Debian 4.9.2-10) 4.9.2, 64-bit.

I have a notification table with about ~45 000 000 rows.

I have some performance issues when trying to fetch rows from the table
with a specific query, I suspect the planner to choose the wrong index
because of the limit.
The query look like this: SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC
LIMIT 20;
It returns 0 rows for this example.

The table looks like this with the indices:

Before testing, I have run a VACUUM ANALYZE, and here are the statistics on
the indices:

SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE '%notifications%';
relname | relkind | reltuples
| relpages
---------------------------------------------------+--------
-+-------------+----------
notifications_pkey | i | 4.55221e+07
| 124820
index_notifications_on_account_id | i | 4.55221e+07
| 124820
index_notifications_on_bucket_id | i | 4.55221e+07
| 124819
index_notifications_on_item_id | i | 4.55221e+07
| 124821
index_notifications_on_version_id | i | 4.55221e+07
| 124821
index_notifications_on_created_at_and_bucket_id | i | 4.55221e+07
| 175281
index_notifications_on_type_and_bucket_id | i | 4.55221e+07
| 188281
notifications_id_seq | S | 1
| 1
notifications | r | 4.55225e+07
| 566412

I tried three different EXPLAIN ANALYZE, on a subset of my table (with the
entire table, I have yet to see what is the total duration of the query
when using LIMIT 20, but it takes more than 5 minutes which is not
acceptable for my use case).

** Without limit **

EXPLAIN ANALYZE SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC;

------------------------------------------------------------
------------------------------------------------------------
------------------------------------
Sort (cost=7474.92..7480.08 rows=2061 width=187) (actual
time=0.149..0.149 rows=0 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on notifications (cost=71.68..7361.47 rows=2061
width=187) (actual time=0.136..0.136 rows=0 loops=1)
Recheck Cond: (((type)::text = ANY ('{foo,bar}'::text[])) AND
(bucket_id = ANY ('{30231,30230,30104}'::integer[])))
-> Bitmap Index Scan on index_notifications_on_type_and_bucket_id
(cost=0.00..71.16 rows=2061 width=0) (actual time=0.135..0.135 rows=0
loops=1)
Index Cond: (((type)::text = ANY ('{foo,bar}'::text[])) AND
(bucket_id = ANY ('{30231,30230,30104}'::integer[])))
Planning time: 0.715 ms
Execution time: 0.198 ms

** With LIMIT 20 **

EXPLAIN ANALYZE SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC
limit 20;

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------------------
Limit (cost=0.43..3341.84 rows=20 width=187) (actual
time=60133.701..60133.701 rows=0 loops=1)
-> Index Scan Backward using
index_notifications_on_created_at_and_bucket_id
on notifications (cost=0.43..344332.66 rows=2061 width=187) (actual
time=60133.695..60133.695 rows=0 loops=1)
Filter: (((type)::text = ANY ('{foo,bar}'::text[])) AND (bucket_id
= ANY ('{30231,30230,30104}'::integer[])))
Rows Removed by Filter: 3441510
Planning time: 1.034 ms
Execution time: 60133.740 ms

** With limit 50 **

EXPLAIN ANALYZE SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC
limit 50;

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=7429.94..7430.06 rows=50 width=187) (actual time=0.111..0.111
rows=0 loops=1)
-> Sort (cost=7429.94..7435.09 rows=2061 width=187) (actual
time=0.110..0.110 rows=0 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on notifications (cost=71.68..7361.47
rows=2061 width=187) (actual time=0.107..0.107 rows=0 loops=1)
Recheck Cond: (((type)::text = ANY ('{foo,bar}'::text[]))
AND (bucket_id = ANY ('{30231,30230,30104}'::integer[])))
-> Bitmap Index Scan on
index_notifications_on_type_and_bucket_id
(cost=0.00..71.16 rows=2061 width=0) (actual time=0.105..0.105 rows=0
loops=1)
Index Cond: (((type)::text = ANY
('{foo,bar}'::text[])) AND (bucket_id = ANY ('{30231,30230,30104}'::intege
r[])))
Planning time: 0.151 ms
Execution time: 0.139 ms

As you can see, when I have the LIMIT 20, the execution time takes around 1
minutes (on a very small subset of the entire table).
Actually I have tried different LIMIT, and when the LIMIT is <= 45, it will
use the index scan backward.

Removing the index 'index_notifications_on_created_at_and_bucket_id' may
prevent the planner from choosing the index scan backward for this query,
but this index is used for other querying on that table...

1) Why is the planner changing index scanning at the threshold of 45 for
the LIMIT ? Why not 50 ? 100 ? I may take the solution in my application to
have a LIMIT > 45 in order to prevent the performance issue, but am I sure
that this threshold will always be the same ?

2) Is it possible for a specific query to force the planner on choosing a
given index or preventing it from choosing one ?

What kind of other options do I have to solve this performance issue ?

Thanks in advance for any help,

Regards,

--
Christophe Escobar

Melvin Davidson

melvin6925@gmail.com

over 9 years ago

In reply to: Christophe Escobar (#1)

Re: Avoid using index scan backward when limit order desc

On Mon, Dec 19, 2016 at 12:05 PM, Christophe Escobar <
christophe.esco@gmail.com> wrote:

Hi,

I am using PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc
(Debian 4.9.2-10) 4.9.2, 64-bit.

I have a notification table with about ~45 000 000 rows.

I have some performance issues when trying to fetch rows from the table
with a specific query, I suspect the planner to choose the wrong index
because of the limit.
The query look like this: SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC
LIMIT 20;
It returns 0 rows for this example.

The table looks like this with the indices:

Table "notifications"
Column | Type |
Modifiers
----------------------------+-----------------------------+-
-----------------------------------------------------------
id | integer | not null
default nextval('notifications_id_seq'::regclass)
account_id | integer |
version_id | integer |
item_id | integer |
type | character varying(255) |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
meta_data | text |
bucket_id | integer |
Indexes:
"notifications_pkey" PRIMARY KEY, btree (id)
"index_notifications_on_account_id" btree (account_id)
"index_notifications_on_bucket_id" btree (bucket_id)
"index_notifications_on_item_id" btree (item_id)
"index_notifications_on_created_at_and_bucket_id" btree (created_at,
bucket_id)
"index_notifications_on_type_and_bucket_id" btree (type, bucket_id)
"index_notifications_on_version_id" btree (version_id)

Before testing, I have run a VACUUM ANALYZE, and here are the statistics
on the indices:

SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE '%notifications%';
relname | relkind | reltuples
| relpages
---------------------------------------------------+--------
-+-------------+----------
notifications_pkey | i | 4.55221e+07
| 124820
index_notifications_on_account_id | i |
4.55221e+07 | 124820
index_notifications_on_bucket_id | i |
4.55221e+07 | 124819
index_notifications_on_item_id | i |
4.55221e+07 | 124821
index_notifications_on_version_id | i |
4.55221e+07 | 124821
index_notifications_on_created_at_and_bucket_id | i |
4.55221e+07 | 175281
index_notifications_on_type_and_bucket_id | i |
4.55221e+07 | 188281
notifications_id_seq | S | 1
| 1
notifications | r | 4.55225e+07
| 566412

I tried three different EXPLAIN ANALYZE, on a subset of my table (with the
entire table, I have yet to see what is the total duration of the query
when using LIMIT 20, but it takes more than 5 minutes which is not
acceptable for my use case).

** Without limit **

EXPLAIN ANALYZE SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC;

------------------------------------------------------------
------------------------------------------------------------
------------------------------------
Sort (cost=7474.92..7480.08 rows=2061 width=187) (actual
time=0.149..0.149 rows=0 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on notifications (cost=71.68..7361.47 rows=2061
width=187) (actual time=0.136..0.136 rows=0 loops=1)
Recheck Cond: (((type)::text = ANY ('{foo,bar}'::text[])) AND
(bucket_id = ANY ('{30231,30230,30104}'::integer[])))
-> Bitmap Index Scan on index_notifications_on_type_and_bucket_id
(cost=0.00..71.16 rows=2061 width=0) (actual time=0.135..0.135 rows=0
loops=1)
Index Cond: (((type)::text = ANY ('{foo,bar}'::text[])) AND
(bucket_id = ANY ('{30231,30230,30104}'::integer[])))
Planning time: 0.715 ms
Execution time: 0.198 ms

** With LIMIT 20 **

EXPLAIN ANALYZE SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC
limit 20;

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------------------
Limit (cost=0.43..3341.84 rows=20 width=187) (actual
time=60133.701..60133.701 rows=0 loops=1)
-> Index Scan Backward using index_notifications_on_created_at_and_bucket_id
on notifications (cost=0.43..344332.66 rows=2061 width=187) (actual
time=60133.695..60133.695 rows=0 loops=1)
Filter: (((type)::text = ANY ('{foo,bar}'::text[])) AND
(bucket_id = ANY ('{30231,30230,30104}'::integer[])))
Rows Removed by Filter: 3441510
Planning time: 1.034 ms
Execution time: 60133.740 ms

** With limit 50 **

EXPLAIN ANALYZE SELECT * FROM notifications WHERE bucket_id IN
(30231,30230,30104) AND type IN ('foo', 'bar') ORDER BY created_at DESC
limit 50;

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=7429.94..7430.06 rows=50 width=187) (actual
time=0.111..0.111 rows=0 loops=1)
-> Sort (cost=7429.94..7435.09 rows=2061 width=187) (actual
time=0.110..0.110 rows=0 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on notifications (cost=71.68..7361.47
rows=2061 width=187) (actual time=0.107..0.107 rows=0 loops=1)
Recheck Cond: (((type)::text = ANY ('{foo,bar}'::text[]))
AND (bucket_id = ANY ('{30231,30230,30104}'::integer[])))
-> Bitmap Index Scan on index_notifications_on_type_and_bucket_id
(cost=0.00..71.16 rows=2061 width=0) (actual time=0.105..0.105 rows=0
loops=1)
Index Cond: (((type)::text = ANY
('{foo,bar}'::text[])) AND (bucket_id = ANY ('{30231,30230,30104}'::intege
r[])))
Planning time: 0.151 ms
Execution time: 0.139 ms

As you can see, when I have the LIMIT 20, the execution time takes around
1 minutes (on a very small subset of the entire table).
Actually I have tried different LIMIT, and when the LIMIT is <= 45, it
will use the index scan backward.

Removing the index 'index_notifications_on_created_at_and_bucket_id' may
prevent the planner from choosing the index scan backward for this query,
but this index is used for other querying on that table...

1) Why is the planner changing index scanning at the threshold of 45 for
the LIMIT ? Why not 50 ? 100 ? I may take the solution in my application to
have a LIMIT > 45 in order to prevent the performance issue, but am I sure
that this threshold will always be the same ?

2) Is it possible for a specific query to force the planner on choosing a
given index or preventing it from choosing one ?

What kind of other options do I have to solve this performance issue ?

Thanks in advance for any help,

Regards,

--
Christophe Escobar

*You can temporarily disable index scanning for a session with*

*SET enable_indexscan = off;*

*and/orSET enable_indexonlyscan = off;*
--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Christophe Escobar (#1)

Re: Avoid using index scan backward when limit order desc

Christophe Escobar <christophe.esco@gmail.com> writes:

I have some performance issues when trying to fetch rows from the table
with a specific query, I suspect the planner to choose the wrong index
because of the limit.
...
1) Why is the planner changing index scanning at the threshold of 45 for
the LIMIT ?

That's just where the crossover point happens to fall for the estimated
costs of the two plans. Of course, the costs are based on the estimate
of 2061 rows matching the WHERE conditions, and since the true figure
is 0 rows, the cost estimates are bad :-(

2) Is it possible for a specific query to force the planner on choosing a
given index or preventing it from choosing one ?

In this example I'd be inclined to prevent the ORDER BY from being
considered while choosing an index, which you can do with the traditional
optimization fence of OFFSET 0:

SELECT * FROM
(SELECT * FROM notifications
WHERE bucket_id IN (30231,30230,30104) AND type IN ('foo', 'bar')
OFFSET 0) ss
ORDER BY created_at DESC limit 20;

Of course, if you have cases where the WHERE conditions will select
a large number of rows, this will prevent the planner from making
a wise choice in those cases --- the use of the created_at index isn't
inherently stupid, it all depends on how many rows match the WHERE.

Depending on how wedded you are to your current data representation,
it might be better to redesign the table so that you don't need to
test two independent columns to select the rows you care about.
That would improve the odds of getting a decent rowcount estimate.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general