Optimizer misses big in 10.4 with BRIN index

Started by Arcadiy Ivanovover 7 years ago7 messages
#1Arcadiy Ivanov
arcadiy@gmail.com

Hi,

Before spamming the list with reproduction examples I want to make sure
the issue isn't already known.

Moving to 10.4 from 9.2 (AWS RDS but repro on local laptop as well)
we've discovered that an optimizer prefers a seq scan to fully analyzed
consistent BRIN index, ending up with a query that is 4.8s long on
seqscan vs 56ms when forcing use of BRIN (85 times difference).
The size of the dataset is millions of rows and with extremely high
probability the rows are naturally clustered on BRIN index column.

Anybody observed anything like that?

========================================================
schema0=# SET enable_seqscan=false;
SET
schema0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT data,
count(*) OVER() AS full_count FROM schema0_lab.data_table WHERE segment
= 'pb1'
                    AND (data->>'tradeDate')::numeric >= '1531267200'
                    AND (data->>'tradeDate')::numeric <= '1531353600'
                    AND data->>'tradeStatus' = 'Replaced'
                    ORDER BY (data->>'tradeDate')::numeric
                    DESC;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=617851.03..617851.24 rows=84 width=1219) (actual
time=55.765..55.794 rows=611 loops=1)
   Output: data, (count(*) OVER (?)), (((data ->>
'tradeDate'::text))::numeric)
   Sort Key: (((data_table.data ->> 'tradeDate'::text))::numeric) DESC
   Sort Method: quicksort  Memory: 1256kB
   Buffers: shared hit=824
   ->  WindowAgg  (cost=1231.98..617848.34 rows=84 width=1219) (actual
time=52.688..55.068 rows=611 loops=1)
         Output: data, count(*) OVER (?), ((data ->>
'tradeDate'::text))::numeric
         Buffers: shared hit=824
         ->  Gather  (cost=1231.98..617846.66 rows=84 width=1179)
(actual time=8.247..51.804 rows=611 loops=1)
               Output: data
               Workers Planned: 2
               Workers Launched: 2
               Buffers: shared hit=824
               ->  Parallel Bitmap Heap Scan on schema0_lab.data_table 
(cost=231.98..616838.26 rows=35 width=1179) (actual time=3.850..46.704
rows=204 loops=3)
                     Output: data
                     Recheck Cond: ((((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data ->> 'tradeDate'::text))::numeric <=
'1531353600'::numeric))
                     Rows Removed by Index Recheck: 4404
                     Filter: (((data_table.segment)::text =
'pb1'::text) AND ((data_table.data ->> 'tradeStatus'::text) =
'Replaced'::text))
                     Heap Blocks: lossy=794
                     Buffers: shared hit=2334
                     Worker 0: actual time=3.572..44.145 rows=236 loops=1
                       Buffers: shared hit=749
                     Worker 1: actual time=0.326..45.184 rows=212 loops=1
                       Buffers: shared hit=761
                     ->  Bitmap Index Scan on tradedate_idx
(cost=0.00..231.96 rows=3377106 width=0) (actual time=4.500..4.500
rows=23040 loops=1)
                           Index Cond: ((((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data ->> 'tradeDate'::text))::numeric <=
'1531353600'::numeric))
                           Buffers: shared hit=30
 Planning time: 0.246 ms
 Execution time: 56.209 ms
(29 rows)

schema0=# SET enable_seqscan=true;
SET
schema0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT data,
count(*) OVER() AS full_count FROM schema0_lab.data_table WHERE segment
= 'pb1'
                    AND (data->>'tradeDate')::numeric >= '1531267200'
                    AND (data->>'tradeDate')::numeric <= '1531353600'
                    AND data->>'tradeStatus' = 'Replaced'
                    ORDER BY (data->>'tradeDate')::numeric
                    DESC;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
 Sort  (cost=617619.05..617619.26 rows=84 width=1219) (actual
time=4823.081..4823.106 rows=611 loops=1)
   Output: data, (count(*) OVER (?)), (((data ->>
'tradeDate'::text))::numeric)
   Sort Key: (((data_table.data ->> 'tradeDate'::text))::numeric) DESC
   Sort Method: quicksort  Memory: 1256kB
   Buffers: shared hit=839 read=187353
   ->  WindowAgg  (cost=1000.00..617616.36 rows=84 width=1219) (actual
time=4820.005..4822.390 rows=611 loops=1)
         Output: data, count(*) OVER (?), ((data ->>
'tradeDate'::text))::numeric
         Buffers: shared hit=839 read=187353
         ->  Gather  (cost=1000.00..617614.68 rows=84 width=1179)
(actual time=3.262..4819.362 rows=611 loops=1)
               Output: data
               Workers Planned: 2
               Workers Launched: 2
               Buffers: shared hit=839 read=187353
               ->  Parallel Seq Scan on schema0_lab.data_table
(cost=0.00..616606.28 rows=35 width=1179) (actual time=6.135..4814.826
rows=204 loops=3)
                     Output: data
                     Filter: (((data_table.segment)::text =
'pb1'::text) AND ((data_table.data ->> 'tradeStatus'::text) =
'Replaced'::text) AND (((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data
 ->> 'tradeDate'::text))::numeric <= '1531353600'::numeric))
                     Rows Removed by Filter: 1125498
                     Buffers: shared hit=2400 read=561439
                     Worker 0: actual time=11.414..4812.744 rows=220
loops=1
                       Buffers: shared hit=775 read=187333
                     Worker 1: actual time=4.249..4813.264 rows=220 loops=1
                       Buffers: shared hit=786 read=186753
 Planning time: 0.232 ms
 Execution time: 4823.412 ms
(24 rows)

schema0=# \d schema0_lab.data_table

                                         Table "schema0_lab.data_table"
   Column   |          Type          | Collation | Nullable
|                     Default
------------+------------------------+-----------+----------+--------------------------------------------------
 id         | integer                |           | not null |
nextval('schema0_lab.data_table_id_seq'::regclass)
 address    | character varying(128) |           | not null |
 segment    | character varying(128) |           | not null |
 data       | jsonb                  |           | not null |
Indexes:
    "data_table_pkey" PRIMARY KEY, btree (id)
    "tradedate_idx" brin (((data ->> 'tradeDate'::text)::numeric)) WITH
(autosummarize='true')

--
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Arcadiy Ivanov (#1)
Re: Optimizer misses big in 10.4 with BRIN index

Hi,

On 07/25/2018 03:58 PM, Arcadiy Ivanov wrote:

                     ->  Bitmap Index Scan on tradedate_idx
(cost=0.00..231.96 rows=3377106 width=0) (actual time=4.500..4.500
rows=23040 loops=1)
                           Index Cond: ((((data_table.data ->>
'tradeDate'::text))::numeric >= '1531267200'::numeric) AND
(((data_table.data ->> 'tradeDate'::text))::numeric <=
'1531353600'::numeric))

My guess is this is the root cause - the estimated number of rows is
much higher than in practice (3377106 vs. 23040), so at the end the
seqscan is considered to be slightly cheaper and wins. But the actual
row count is ~150x lower, making the bitmap index scan way faster.

IMHO you'll need to find a way to improve the estimates, which may be
difficult. The first thing I'd try is creating an expression index on
the expression you use in the WHERE clause. Something like

CREATE INDEX ON data_table (((data_table.data ->>
'tradeDate'::text))::numeric);

And then ANALYZE the table again ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3David Rowley
david.rowley@2ndquadrant.com
In reply to: Tomas Vondra (#2)
Re: Optimizer misses big in 10.4 with BRIN index

On 26 July 2018 at 04:50, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

My guess is this is the root cause - the estimated number of rows is much
higher than in practice (3377106 vs. 23040), so at the end the seqscan is
considered to be slightly cheaper and wins. But the actual row count is
~150x lower, making the bitmap index scan way faster.

Isn't the 23040 just the totalpages * 10 per `return totalpages * 10;`
in bringetbitmap()?

The BRIN index costing was changed quite dramatically in [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7e534adcdc70866e7be74d626b0ed067c890a251 which
first appears in pg10. Previous to that patch BRIN assumed it would
touch total_pages * pred_selectivity blocks in the bitmap scan. That
worked well when the table happened to be perfectly ordered by the
column of the BRIN index, but was pretty broken when the order was
random. That lead to BRIN indexes being used when they really
shouldn't have been. The patch tried to fix that by using what
information it could. That was basically the correlation statistics
from pg_statistic. It appeared that the patch was producing more
realistic plans than unpatched, so we went with it, but it's
definitely not perfect; what statistics are?

It would be quite interesting to know the result of:

select stakind3 from pg_Statistic where starelid =
'schema0_lab.data_table'::regclass;

I also see the estimated costs of either plan are very close, so the
chosen plan may well be quite susceptible to changing after different
ANALYZE runs on the table.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7e534adcdc70866e7be74d626b0ed067c890a251

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Emre Hasegeli
emre@hasegeli.com
In reply to: David Rowley (#3)
Re: Optimizer misses big in 10.4 with BRIN index

Isn't the 23040 just the totalpages * 10 per `return totalpages * 10;`
in bringetbitmap()?

Yes, it is just confusing. The correct value is on one level up of
the tree. It is 204 + 4404 rows removed by index recheck = 4608, so
the estimate is not only 150x but 733x off :(.

The sequential scan plan shows 204 + 1125498 rows removed by filter =
1125702 as the actual table size. However the former plan estimates
to get 3377106 rows from the index. That is 3x of the table size.
The selectivity estimation cannot be greater than 1. If I am not
missing anything, the general statistics of this table should be
seriously outdated.

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Emre Hasegeli (#4)
Re: Optimizer misses big in 10.4 with BRIN index

On 07/26/2018 10:11 AM, Emre Hasegeli wrote:

Isn't the 23040 just the totalpages * 10 per `return totalpages * 10;`
in bringetbitmap()?

Yes, it is just confusing. The correct value is on one level up of
the tree. It is 204 + 4404 rows removed by index recheck = 4608, so
the estimate is not only 150x but 733x off :(.

The sequential scan plan shows 204 + 1125498 rows removed by filter =
1125702 as the actual table size. However the former plan estimates
to get 3377106 rows from the index. That is 3x of the table size.
The selectivity estimation cannot be greater than 1. If I am not
missing anything, the general statistics of this table should be
seriously outdated.

Hmmm, yeah. It's s bot confusing, and the parallel plan does not improve
the situation either :-(

Arcadiy, can you provide plans with parallel query disabled? Or even
better, produce a test case that reproduces this (using synthetic data,
anonymized data or something like that, if needed).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Arcadiy Ivanov
arcadiy@gmail.com
In reply to: Tomas Vondra (#5)
Re: Optimizer misses big in 10.4 with BRIN index

On 07/26/2018 07:27 AM, Tomas Vondra wrote:

Arcadiy, can you provide plans with parallel query disabled? Or even
better, produce a test case that reproduces this (using synthetic
data, anonymized data or something like that, if needed).

So I basically spent most of the time trying to create a reproducible
case and I can say I failed. I can however reproduce this with specific
large data set with specific data distribution, but not an artificial
one. Unfortunately data is restricted so I have to obfuscate the data
structures. That said, I am able to reproduce this sporadically on local
machine after exporting from RDS 10.4. On some days reproduction happens
always, on others it's hard to reproduce at all. So here it goes in
sequential order:

** Data is imported into database
======================
db0=# \d+ schema0.table0
                                         Table "schema0.table0"
   Column   |          Type          | Collation | Nullable | Default |
Storage  | Stats target | Description
------------+------------------------+-----------+----------+---------+----------+--------------+-------------
 field1     | character varying(128) |           | not null |         |
extended |              |
 field2     | character varying(128) |           | not null |         |
extended |              |
 field3     | character varying(128) |           | not null |         |
extended |              |
 field4     | bigint                 |           | not null |         |
plain    |              |
 field5     | bigint                 |           | not null |         |
plain    |              |
 field6     | bigint                 |           | not null |         |
plain    |              |
 data       | jsonb                  |           | not null |         |
extended |              |
Indexes:
    "date_idx" brin (((data ->> 'date'::text)::numeric)) WITH
(autosummarize='true')
Options: autovacuum_vacuum_scale_factor=0.0,
autovacuum_vacuum_threshold=5000, autovacuum_analyze_scale_factor=0.0,
autovacuum_analyze_threshold=5000, fillfactor=75

** Data distribution in the table
======================
db0# SELECT (data->>'date')::numeric as date, count(1) FROM
schema0.table0 GROUP BY date ORDER BY date;
    date    |  count
------------+---------
 1527580800 |       5
 1527753600 |      72
 1527840000 |      36
 1528012800 |      18
 1528099200 |      72
 1528185600 |    7266
 1528272000 |     336
 1528358400 |     230
 1528444800 |      90
 1528704000 |      76
 1528790400 |       4
 1528876800 |    9060
 1528934400 |       6
 1528963200 |      33
 1529193600 |       6
 1529280000 |      18
 1529294400 |  541344
 1529380800 | 1113121
 1529467200 |  794082
 1529553600 |  604752
 1529812800 |  135252
 1529899200 |   63222
 1529985600 |   31134
 1530072000 |   25392
 1530158400 |      90
 1530417600 |      48
 1530504000 |     144
 1530518400 |       1
 1530590400 |    1958
 1530604800 |       1
 1530763200 |   48614
 1530849600 |       9
 1531022400 |       3
 1531108800 |       3
 1531195200 |       1
*** This is the range we're interested in
 1531281600 |    3713
*******
 1531627200 |     300
 1531713600 |    1113
 1531800000 |      30
 1531886400 |   36000
 1532232000 |   24900
 1532577600 |  409364
 1532664000 |    2050
 1532836800 |  334782
 1532923200 |    6771
(45 rows)

** No analysis was run after import. Let's see how we do.
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1102962.86..1103079.54 rows=1000 width=38) (actual
time=7234.091..7234.697 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=9 read=350293
   ->  Gather Merge  (cost=1102962.86..1105002.57 rows=17482 width=38)
(actual time=7234.089..7234.589 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=9 read=350293
         ->  Sort  (cost=1101962.84..1101984.69 rows=8741 width=38)
(actual time=7227.938..7227.985 rows=964 loops=3)
               Output: ctid, (((data ->> 'date'::text))::numeric)
               Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
               Sort Method: quicksort  Memory: 102kB
               Buffers: shared hit=144 read=1048851
               Worker 0: actual time=7223.770..7223.820 rows=1272 loops=1
                 Buffers: shared hit=68 read=348293
               Worker 1: actual time=7226.520..7226.589 rows=1299 loops=1
                 Buffers: shared hit=67 read=350265
               ->  Parallel Seq Scan on schema0.table0
(cost=0.00..1101390.58 rows=8741 width=38) (actual time=68.862..7227.366
rows=1238 loops=3)
                     Output: ctid, ((data ->> 'date'::text))::numeric
                     Filter: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Rows Removed by Filter: 1397270
                     Buffers: shared hit=32 read=1048849
                     Worker 0: actual time=50.494..7223.107 rows=1272
loops=1
                       Buffers: shared hit=11 read=348293
                     Worker 1: actual time=53.408..7225.849 rows=1299
loops=1
                       Buffers: shared hit=12 read=350263
 Planning time: 2.096 ms
 Execution time: 7235.867 ms
(28 rows)

** Ok, no statistics, Par Seq Scan is selected, it's expected.
=======================
db0=# VACUUM ANALYZE VERBOSE schema0.table0;
INFO:  vacuuming "schema0.table0"
INFO:  index "date_idx" now contains 8194 row versions in 45 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 out of 1048881 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 2.36 s, system: 4.84 s, elapsed: 7.48 s.
INFO:  vacuuming "pg_toast.pg_toast_7888880"
INFO:  index "pg_toast_7888880_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_7888880": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1243
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "schema0.table0"
INFO:  "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
VACUUM

** Statistics are collected. Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.674..7498.791 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=74 read=411764
   ->  Sort  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7498.672..7498.731 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
         Sort Method: top-N heapsort  Memory: 95kB
         Buffers: shared hit=74 read=411764
         ->  Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=58.322..7497.656 rows=3713 loops=1)
               Output: ctid, ((data ->> 'date'::text))::numeric
               Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
               Rows Removed by Index Recheck: 1643390
               Heap Blocks: lossy=411776
               Buffers: shared hit=74 read=411764
               ->  Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=40.243..40.243 rows=4117760 loops=1)
                     Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Buffers: shared hit=62
 Planning time: 1.388 ms
 Execution time: 7498.907 ms
(19 rows)

** Wait... What??? "Rows Removed by Index Recheck: 1643390" but data is
almost sequential! Let's take a look at it.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
   ctid    |    date
-----------+------------
 (4456,2)  | 1531281600
 (4468,2)  | 1531281600
 (9301,2)  | 1531281600
 (9310,3)  | 1531281600
...
 (9317,4)  | 1531281600
 (9325,3)  | 1531281600
 (9325,4)  | 1531281600
 (9326,1)  | 1531281600
 (9328,1)  | 1531281600
 (9328,2)  | 1531281600
 (9345,1)  | 1531281600
...
 (9365,2)  | 1531281600
 (9386,2)  | 1531281600
 (9386,3)  | 1531281600
 (9386,4)  | 1531281600
 (9418,4)  | 1531281600
...
 (9419,4)  | 1531281600
 (12921,2) | 1531281600
...
 (12962,2) | 1531281600
 (12983,1) | 1531281600
 (12983,2) | 1531281600
 (12983,3) | 1531281600
 (12983,4) | 1531281600
 (13028,3) | 1531281600
 (13028,4) | 1531281600
 (14113,1) | 1531281600
 (14113,2) | 1531281600
 (15297,1) | 1531281600
...
 (15317,2) | 1531281600
 (15350,1) | 1531281600
 (15350,2) | 1531281600
 (15350,3) | 1531281600
 (15350,4) | 1531281600
 (15386,3) | 1531281600
 (15386,4) | 1531281600
 (17620,3) | 1531281600
...
 (17648,4) | 1531281600
 (17652,1) | 1531281600
...
 (17653,2) | 1531281600
 (17658,1) | 1531281600
...
 (17659,2) | 1531281600
 (21229,2) | 1531281600
 (21243,1) | 1531281600
...
 (21246,4) | 1531281600
 (35794,2) | 1531281600
...
 (35798,1) | 1531281600
 (35827,4) | 1531281600
...
 (35887,2) | 1531281600
 (37006,3) | 1531281600
...
 (37080,1) | 1531281600
 (37157,2) | 1531281600
 (37157,3) | 1531281600
 (37157,4) | 1531281600
 (39475,1) | 1531281600
...
 (39552,4) | 1531281600
 (39623,4) | 1531281600
 (39653,1) | 1531281600
 (39653,2) | 1531281600
 (39653,3) | 1531281600
 (39653,4) | 1531281600
 (39674,1) | 1531281600
 (39674,2) | 1531281600
 (39674,3) | 1531281600
 (39674,4) | 1531281600
 (40829,3) | 1531281600
 (40829,4) | 1531281600
 (41946,4) | 1531281600
...
 (41962,2) | 1531281600
(1000 rows)

** Ok, fine, there are large gaps. Let's make **absolutely sure** the
data is sequential.
=======================
db0=# CREATE INDEX date_ord_idx ON schema0.table0 (((table0.data ->>
'date')::numeric));
CREATE INDEX

** Not running any analysis on a new index, shouldn't matter if stats
are wrong, expecting scan and sort.
=======================
db0=# CLUSTER VERBOSE schema0.table0 USING date_ord_idx;
INFO:  clustering "schema0.table0" using sequential scan and sort
INFO:  "table0": found 0 removable, 4195522 nonremovable row versions in
1048881 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 873.12 s, system: 19.18 s, elapsed: 1019.36 s.
CLUSTER

** Ok, now we're clustered. Dropping order b-tree index.
=======================
db0=# DROP INDEX schema0.date_ord_idx ;
DROP INDEX

** Let's explain again
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.584..7339.703 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=56 read=410752
   ->  Sort  (cost=2393.45..2394.66 rows=482 width=38) (actual
time=7339.583..7339.634 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
         Sort Method: top-N heapsort  Memory: 95kB
         Buffers: shared hit=56 read=410752
         ->  Bitmap Heap Scan on schema0.table0 (cost=162.16..2371.97
rows=482 width=38) (actual time=59.252..7338.635 rows=3713 loops=1)
               Output: ctid, ((data ->> 'date'::text))::numeric
               Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
               Rows Removed by Index Recheck: 1639294
               Heap Blocks: lossy=410752
               Buffers: shared hit=56 read=410752
               ->  Bitmap Index Scan on date_idx (cost=0.00..162.03
rows=557 width=0) (actual time=24.545..24.545 rows=4107520 loops=1)
                     Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Buffers: shared hit=56
 Planning time: 0.515 ms
 Execution time: 7339.772 ms
(19 rows)

**  "Rows Removed by Index Recheck: 1639294" - OK, statistics are
pooched post-clustering. Let's reanalyze.
=======================
db0=# ANALYZE VERBOSE schema0.table0;
INFO:  analyzing "schema0.table0"
INFO:  "table0": scanned 30000 of 1048881 pages, containing 120000 live
rows and 0 dead rows; 30000 rows in sample, 4195524 estimated total rows
ANALYZE

**  Let's try this again...
=======================
db0=# EXPLAIN (analyze, verbose, costs, buffers) SELECT ctid,
(data->>'date')::numeric as date FROM schema0.table0 WHERE
(data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.824..6533.927 rows=1000 loops=1)
   Output: ctid, (((data ->> 'date'::text))::numeric)
   Buffers: shared hit=410808
   ->  Sort  (cost=5444.33..5445.71 rows=555 width=38) (actual
time=6533.822..6533.862 rows=1000 loops=1)
         Output: ctid, (((data ->> 'date'::text))::numeric)
         Sort Key: table0.ctid, (((table0.data ->>
'date'::text))::numeric) DESC
         Sort Method: top-N heapsort  Memory: 95kB
         Buffers: shared hit=410808
         ->  Bitmap Heap Scan on schema0.table0 (cost=162.22..5419.03
rows=555 width=38) (actual time=59.316..6532.913 rows=3713 loops=1)
               Output: ctid, ((data ->> 'date'::text))::numeric
               Recheck Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
               Rows Removed by Index Recheck: 1639294
               Heap Blocks: lossy=410752
               Buffers: shared hit=410808
               ->  Bitmap Index Scan on date_idx (cost=0.00..162.08
rows=1339 width=0) (actual time=30.859..30.859 rows=4107520 loops=1)
                     Index Cond: ((((table0.data ->>
'date'::text))::numeric >= '1531267200'::numeric) AND (((table0.data ->>
'date'::text))::numeric <= '1531353600'::numeric))
                     Buffers: shared hit=56
 Planning time: 1.100 ms
 Execution time: 6534.026 ms
(19 rows)

** This cannot be possible! The table must be clustered on the same
expression we cluster BRIN! Let's look at the data.
=======================
db0=# SELECT ctid, (data->>'date')::numeric as date FROM schema0.table0
WHERE (data->>'date')::numeric >= '1531267200'
                    AND (data->>'date')::numeric <= '1531353600'
                    ORDER BY ctid, (data->>'date')::numeric
                    DESC LIMIT 1000;
   ctid    |    date
-----------+------------
 (4155,2)  | 1531281600
...
 (4159,1)  | 1531281600
 (4188,4)  | 1531281600
 (4189,1)  | 1531281600
 (4189,2)  | 1531281600
 (4189,3)  | 1531281600
 (4189,4)  | 1531281600
 (4231,4)  | 1531281600
 (4247,3)  | 1531281600
 (4247,4)  | 1531281600
 (4248,1)  | 1531281600
 (4248,2)  | 1531281600
 (5367,3)  | 1531281600
...
 (5368,4)  | 1531281600
 (5376,3)  | 1531281600
...
 (5441,1)  | 1531281600
 (5518,2)  | 1531281600
 (5518,3)  | 1531281600
 (5518,4)  | 1531281600
 (7836,1)  | 1531281600
 (7836,2)  | 1531281600
 (7851,3)  | 1531281600
...
 (7913,4)  | 1531281600
 (7984,4)  | 1531281600
 (8014,1)  | 1531281600
 (8014,2)  | 1531281600
 (8014,3)  | 1531281600
 (8014,4)  | 1531281600
 (8035,1)  | 1531281600
 (8035,2)  | 1531281600
 (8035,3)  | 1531281600
 (8035,4)  | 1531281600
 (9190,3)  | 1531281600
 (9190,4)  | 1531281600
 (10307,4) | 1531281600
...
 (10349,3) | 1531281600
 (10371,3) | 1531281600
...
 (10392,2) | 1531281600
 (10445,4) | 1531281600
 (10446,1) | 1531281600
 (10446,2) | 1531281600
 (11580,3) | 1531281600
...
 (11586,2) | 1531281600
 (11595,3) | 1531281600
...
 (11598,2) | 1531281600
 (11600,1) | 1531281600
...
 (11601,2) | 1531281600
 (11609,1) | 1531281600
...
 (11621,2) | 1531281600
 (11621,3) | 1531281600
 (11664,3) | 1531281600
...
 (11684,3) | 1531281600
 (11726,2) | 1531281600
...
 (11770,4) | 1531281600
 (12891,2) | 1531281600
 (14057,1) | 1531281600
...
 (14058,2) | 1531281600
 (14064,3) | 1531281600
...
 (14071,3) | 1531281600
(1000 rows)

****

The issue seems to be with both clustering and BRIN stats.

Fillfactor on the table is not a factor, can be 100 can be 75, I'm able
to reproduce and fail to reproduce with both.
Clustering on `date_ord_idx` can be done with ANALYZE and without, in
some cases PG picks scan and sort, sometimes it uses index regardless of
stats.
Sometimes the table is properly clustered, sometimes it isn't
afterwards. Sometimes if you cluster first on the unanalyzed
`date_ord_idx`, clustering doesn't fix the problem, but then analyzing
and clustering on `date_ord_idx` again does fix the issue. Sometimes
neither helps.
Full vacuum analyze may help, or may not help, depending on the phase of
the moon.
When BRIN is actually fixed, the explain has a few hundred rows
eliminated by the recheck, not >1M, and queries are 50 - 70 ms long for
date range.

If anybody can give a suggestion on where in the code to start looking,
I'll be extra-grateful.

--
Arcadiy Ivanov
arcadiy@gmail.com | @arcivanov | https://ivanov.biz
https://github.com/arcivanov

#7Emre Hasegeli
emre@hasegeli.com
In reply to: Arcadiy Ivanov (#6)
Re: Optimizer misses big in 10.4 with BRIN index

So I basically spent most of the time trying to create a reproducible case
and I can say I failed. I can however reproduce this with specific large
data set with specific data distribution, but not an artificial one.

The query plans posted that has the statistics prefer Bitmap Index
Scan. This is not reproduction of the originally posted case.

** Wait... What??? "Rows Removed by Index Recheck: 1643390" but data is
almost sequential! Let's take a look at it.

I don't think it has anything to do with query planning. Have you
tried "pages_per_range" option of BRIN?