BRIN index which is much faster never chosen by planner

Started by Jeremy Finzelover 6 years ago17 messages
#1Jeremy Finzel
finzelj@gmail.com

Good Afternoon,

I posted about this on another thread here
</messages/by-id/CAMa1XUiH3hL3KGwdKGjnJdJeo2A5H1o1uhtXWBkmMqixrDCWMA@mail.gmail.com&gt;,
but the topic was not precisely planner issues, so I wanted to post it here.

I am running Postgres 11.5. I have a table that is insert-only and has 312
million rows. It is also pruned continuously to only past year. The size
of the table is 223 GB with indexes, 140GB without. One of the fields is
rec_insert_time timestamptz. Here are all potentially relevant table stats:

schemaname | foo
relname | log_table
n_tup_ins | 86850506
n_tup_upd | 0
n_tup_del | 68916115
n_tup_hot_upd | 0
n_live_tup | 312810691
n_dead_tup | 9405132
n_mod_since_analyze | 11452
last_vacuum | 2019-09-20 09:41:43.78383-05
last_autovacuum | 2019-10-04 13:56:16.810355-05
last_analyze | 2019-10-10 09:34:26.807695-05
last_autoanalyze |
vacuum_count | 2
autovacuum_count | 1
analyze_count | 13
autoanalyze_count | 0
total_relation_size | 223 GB
relation_size | 139 GB
table_size | 140 GB

I have a simple query looking at past 10 days based on rec_insert_time, and
it will not choose the BRIN index even with several configurations. Here
are my all relevant indexes (I intentionally do not have a btree on
rec_insert_time because I believe BRIN *should* fit better here):

"log_table_brand_id_product_rec_insert_time_idx" btree (brand_id, product,
rec_insert_time)
"log_table_rec_insert_time_idx" brin (rec_insert_time)
"log_table_rec_insert_time_idx1" brin (rec_insert_time) WITH
(pages_per_range='64')
"rec_insert_time_brin_1000" brin (rec_insert_time) WITH
(pages_per_range='1000')

And here is the SQL:
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases
;

This query will choose a seq scan on log_table every time in spite of these
BRIN indexes on rec_insert_time.
@Michael Lewis <mlewis@entrata.com> had suggested I check
default_statistics_target for this column. I raised it to 5000 for this
column and still it's choosing a seq scan.

Here is default chosen plan (takes 2 minutes 12 seconds):
WindowAgg (cost=24437881.80..24437897.70 rows=707 width=120) (actual
time=132173.173..132173.222 rows=53 loops=1)
Output: unique_cases.category, unique_cases.source,
min(unique_cases.rec_insert_time) OVER (?),
max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Sort (cost=24437881.80..24437883.56 rows=707 width=104) (actual
time=132173.146..132173.149 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=391676 read=17772642 dirtied=4679 written=7
-> Subquery Scan on unique_cases (cost=24437834.20..24437848.34
rows=707 width=104) (actual time=132172.950..132173.062 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
-> Unique (cost=24437834.20..24437841.27 rows=707
width=124) (actual time=132172.946..132173.048 rows=53 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Buffers: shared hit=391676 read=17772642 dirtied=4679
written=7
-> Sort (cost=24437834.20..24437835.96 rows=707
width=124) (actual time=132172.939..132172.962 rows=466 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 90kB
Buffers: shared hit=391676 read=17772642
dirtied=4679 written=7
-> Nested Loop (cost=0.00..24437800.73
rows=707 width=124) (actual time=4096.253..132171.425 rows=466 loops=1)
Output: l.category, source(l.field1),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Inner Unique: true
Join Filter: ((l.category)::text =
filter.category)
Rows Removed by Join Filter: 346704
Buffers: shared hit=391676 read=17772642
dirtied=4679 written=7
-> Seq Scan on foo.log_table l
(cost=0.00..24420483.80 rows=707 width=99) (actual
time=4095.763..132112.686 rows=466 loops=1)
Output: <hidden>
Filter: ((l.field1 IS NOT NULL) AND
(l.category = 'music'::name) AND (l.rec_insert_time >= (now() - '10
days'::interval)))
Rows Removed by Filter: 312830265
Buffers: shared hit=391675
read=17772636 dirtied=4679 written=7
-> Materialize (cost=0.00..33.98
rows=1399 width=8) (actual time=0.001..0.036 rows=745 loops=466)
Output: filter.category
Buffers: shared hit=1 read=6
-> Seq Scan on
public.small_join_table filter (cost=0.00..26.99 rows=1399 width=8)
(actual time=0.054..0.189 rows=745 loops=1)
Output: filter.category
Buffers: shared hit=1 read=6
Planning Time: 0.552 ms
Execution Time: 132173.657 ms
(38 rows)

Here is the plan I get when I turn off seqscan and indexscan!!! (21
seconds):
SET enable_seqscan TO false;
SET enable_indexscan TO false;

WindowAgg (cost=24363224.32..24363240.85 rows=735 width=120) (actual
time=21337.992..21338.040 rows=53 loops=1)
Output: unique_cases.category, unique_cases.source,
min(unique_cases.rec_insert_time) OVER (?),
max(unique_cases.rec_insert_time) OVER (?), unique_cases.rec_insert_time
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Sort (cost=24363224.32..24363226.15 rows=735 width=104) (actual
time=21337.965..21337.968 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=1471 read=1509030 dirtied=121 written=1631
-> Subquery Scan on unique_cases (cost=24363174.62..24363189.32
rows=735 width=104) (actual time=21337.777..21337.889 rows=53 loops=1)
Output: unique_cases.source, unique_cases.rec_insert_time,
unique_cases.category
Buffers: shared hit=1471 read=1509030 dirtied=121
written=1631
-> Unique (cost=24363174.62..24363181.97 rows=735
width=124) (actual time=21337.772..21337.874 rows=53 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Buffers: shared hit=1471 read=1509030 dirtied=121
written=1631
-> Sort (cost=24363174.62..24363176.46 rows=735
width=124) (actual time=21337.767..21337.791 rows=466 loops=1)
Output: l.category, (source(l.field1)),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 90kB
Buffers: shared hit=1471 read=1509030
dirtied=121 written=1631
-> Nested Loop (cost=2393.31..24363139.63
rows=735 width=124) (actual time=824.212..21336.263 rows=466 loops=1)
Output: l.category, source(l.field1),
l.rec_insert_time, l.brand_id, l.last_change, l.log_id
Inner Unique: true
Buffers: shared hit=1471 read=1509030
dirtied=121 written=1631
-> Bitmap Heap Scan on foo.log_table l
(cost=2391.71..24360848.29 rows=735 width=99) (actual
time=824.133..21329.054 rows=466 loops=1)
Output: <hidden>
Recheck Cond: (l.rec_insert_time >=
(now() - '10 days'::interval))
Rows Removed by Index Recheck:
8187584
Filter: ((l.field1 IS NOT NULL) AND
(l.category = 'music'::name))
Rows Removed by Filter: 19857107
Heap Blocks: lossy=1509000
Buffers: shared hit=73 read=1509030
dirtied=121 written=1631
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2391.52 rows=157328135 width=0)
(actual time=72.391..72.391 rows=15090000 loops=1)
Index Cond: (l.rec_insert_time

= (now() - '10 days'::interval))

Buffers: shared hit=29 read=74
-> Bitmap Heap Scan on
public.small_join_table filter (cost=1.60..3.12 rows=1 width=8) (actual
time=0.010..0.010 rows=1 loops=466)
Output: filter.category,
filter.type, filter.location
Recheck Cond: (filter.category =
(l.category)::text)
Heap Blocks: exact=466
Buffers: shared hit=1398
-> Bitmap Index Scan on
small_join_table_pkey (cost=0.00..1.60 rows=1 width=0) (actual
time=0.007..0.007 rows=1 loops=466)
Index Cond: (filter.category =
(l.category)::text)
Buffers: shared hit=932
Planning Time: 1.869 ms
Execution Time: 21338.244 ms
(44 rows)

Notice it chooses the smallest BRIN index with 1000 pages per range, and
this is far faster than the seq scan.

I do believe the estimate is actually way off. Just a plain EXPLAIN of the
latter estimates 10x more rows than actual:
WindowAgg (cost=24354689.19..24354705.07 rows=706 width=120)
-> Sort (cost=24354689.19..24354690.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24354641.66..24354655.78
rows=706 width=104)
-> Unique (cost=24354641.66..24354648.72 rows=706
width=124)
-> Sort (cost=24354641.66..24354643.42 rows=706
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=2385.42..24354608.25
rows=706 width=124)
-> Bitmap Heap Scan on log_table l
(cost=2383.82..24352408.26 rows=706 width=99)
Recheck Cond: (rec_insert_time >=
(now() - '10 days'::interval))
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2383.64 rows=156577455 width=0)
Index Cond: (rec_insert_time

= (now() - '10 days'::interval))

-> Bitmap Heap Scan on small_join_table
filter (cost=1.60..3.12 rows=1 width=8)
Recheck Cond: (category =
(l.category)::text)
-> Bitmap Index Scan on
small_join_table_pkey (cost=0.00..1.60 rows=1 width=0)
Index Cond: (category =
(l.category)::text)
(17 rows)

Here is EXPLAIN only of the default chosen plan:
WindowAgg (cost=24437857.18..24437873.07 rows=706 width=120)
-> Sort (cost=24437857.18..24437858.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24437809.66..24437823.78
rows=706 width=104)
-> Unique (cost=24437809.66..24437816.72 rows=706
width=124)
-> Sort (cost=24437809.66..24437811.42 rows=706
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24437776.25
rows=706 width=124)
Join Filter: ((l.category)::text =
filter.category)
-> Seq Scan on log_table l
(cost=0.00..24420483.80 rows=706 width=99)
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
-> Materialize (cost=0.00..33.98
rows=1399 width=8)
-> Seq Scan on small_join_table
filter (cost=0.00..26.99 rows=1399 width=8)
(13 rows)

Any insight into this is much appreciated. This is just one example of
many similar issues I have been finding with BRIN indexes scaling
predictably with insert-only workloads.

Thanks!
Jeremy

#2Michael Lewis
mlewis@entrata.com
In reply to: Jeremy Finzel (#1)
Re: BRIN index which is much faster never chosen by planner

Since the optimizer is choosing a seq scan over index scan when it seems
like it has good row estimates in both cases, to me that may mean costs of
scanning index are expected to be high. Is this workload on SSD? Has the
random_page_cost config been decreased from default 4 (compared with cost
of 1 unit for sequential scan)?

Your buffer hits aren't great. What is shared_buffers set to? How much ram
on this cluster?

With this table being insert only, one assumes correlation is very high on
the data in this column as shown in pg_stats, but have your confirmed?

To me, distinct ON is often a bad code smell and probably can be re-written
to be much more efficient with GROUP BY, lateral & order by, or some other
tool. Same with the window function. It is a powerful tool, but sometimes
not the right one.

Is "source" a function that is called on field1? What is it doing/how is it
defined?

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeremy Finzel (#1)
Re: BRIN index which is much faster never chosen by planner

On Thu, Oct 10, 2019 at 04:58:11PM -0500, Jeremy Finzel wrote:

...

Notice it chooses the smallest BRIN index with 1000 pages per range, and
this is far faster than the seq scan.

I do believe the estimate is actually way off. Just a plain EXPLAIN of the
latter estimates 10x more rows than actual:
WindowAgg (cost=24354689.19..24354705.07 rows=706 width=120)
-> Sort (cost=24354689.19..24354690.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24354641.66..24354655.78
rows=706 width=104)
-> Unique (cost=24354641.66..24354648.72 rows=706
width=124)
-> Sort (cost=24354641.66..24354643.42 rows=706
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=2385.42..24354608.25
rows=706 width=124)
-> Bitmap Heap Scan on log_table l
(cost=2383.82..24352408.26 rows=706 width=99)
Recheck Cond: (rec_insert_time >=
(now() - '10 days'::interval))
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2383.64 rows=156577455 width=0)
Index Cond: (rec_insert_time

= (now() - '10 days'::interval))

-> Bitmap Heap Scan on small_join_table
filter (cost=1.60..3.12 rows=1 width=8)
Recheck Cond: (category =
(l.category)::text)
-> Bitmap Index Scan on
small_join_table_pkey (cost=0.00..1.60 rows=1 width=0)
Index Cond: (category =
(l.category)::text)
(17 rows)

Here is EXPLAIN only of the default chosen plan:
WindowAgg (cost=24437857.18..24437873.07 rows=706 width=120)
-> Sort (cost=24437857.18..24437858.95 rows=706 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24437809.66..24437823.78
rows=706 width=104)
-> Unique (cost=24437809.66..24437816.72 rows=706
width=124)
-> Sort (cost=24437809.66..24437811.42 rows=706
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24437776.25
rows=706 width=124)
Join Filter: ((l.category)::text =
filter.category)
-> Seq Scan on log_table l
(cost=0.00..24420483.80 rows=706 width=99)
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
-> Materialize (cost=0.00..33.98
rows=1399 width=8)
-> Seq Scan on small_join_table
filter (cost=0.00..26.99 rows=1399 width=8)
(13 rows)

Any insight into this is much appreciated. This is just one example of
many similar issues I have been finding with BRIN indexes scaling
predictably with insert-only workloads.

It's quite interesting planning issue. The cost estimates are:

-> Seq Scan on foo.log_table l
(cost=0.00..24420483.80 rows=707 width=99) (actual

while for the bitmap heap scan it looks like this:

-> Bitmap Heap Scan on foo.log_table l
(cost=2391.71..24360848.29 rows=735 width=99) (actual

So the planner actualy thinks the bitmap heap scan is a tad *cheaper*
but picks the seq scan anyway. This is likely because we don't really
compare the exact costs, but we do fuzzy comparison - the plan has to be
at least 1% cheaper to dominate the existing plan. This allows us to
save some work when replacing the paths.

In this case the difference is only about 0.2%, so we keep the seqscan
path. The real question is why the planner came to this cost, when it
got pretty good row estimates etc.

Looking at the cost_bitmap_heap_scan() I think the costing issue comes
mostly from this bit:

/*
* For small numbers of pages we should charge spc_random_page_cost
* apiece, while if nearly all the table's pages are being read, it's more
* appropriate to charge spc_seq_page_cost apiece. The effect is
* nonlinear, too. For lack of a better idea, interpolate like this to
* determine the cost per page.
*/
if (pages_fetched >= 2.0)
cost_per_page = spc_random_page_cost -
(spc_random_page_cost - spc_seq_page_cost)
* sqrt(pages_fetched / T);
else
cost_per_page = spc_random_page_cost;

The index scan is estimated to return 157328135 rows, i.e. about 50% of
the table (apparently it's ~10x more than the actual number). This is
produced by compute_bitmap_pages() which also computes pages_fetched,
and I guess that's going to be pretty close to all pages, because with
T = 18350080 (which is 140GB) and using

pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);

we get 29731418 (which is more than 18350080, so it gets clamped).

So this kinda seems like the optimizer kinda believes it'll have to scan
the whole table anyway. In reality, of course, the number of tuples
returned by the index is 10x lower, so the formula above would give us
only about 10975262 pages (so ~1/2 the table). The actual number of
pages is however even lower - only about 1509030, i.e. ~8% of the table.

So this seems like a combination of multiple issues. Firstly, the bitmap
index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
might be worth increasing stats target on that column, or something like
that. Not sure, but it's about the only "fixable" thing here, I think.

The other issue is that the estimation of pages fetched using bitmap
heap scan is rather crude - but that's simply hard, and I don't think we
can fundamentally improve this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4David Rowley
david.rowley@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: BRIN index which is much faster never chosen by planner

On Fri, 11 Oct 2019 at 12:13, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

The index scan is estimated to return 157328135 rows, i.e. about 50% of
the table (apparently it's ~10x more than the actual number).

Don't pay too much attention to the actual row counts from bitmap
index scans of brin indexes. The value is entirely made up, per:

/*
* XXX We have an approximation of the number of *pages* that our scan
* returns, but we don't have a precise idea of the number of heap tuples
* involved.
*/
return totalpages * 10;

in bringetbitmap().

(Ideally EXPLAIN would be written in such a way that it didn't even
show the actual rows for node types that don't return rows. However,
I'm sure that would break many explain parsing tools)

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

The planner will just estimate the selectivity of now() - interval '10
days' by using DEFAULT_INEQ_SEL, which is 0.3333333333333333, so it
thinks it'll get 1/3rd of the table. Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#5Michael Lewis
mlewis@entrata.com
In reply to: David Rowley (#4)
Re: BRIN index which is much faster never chosen by planner

On Thu, Oct 10, 2019 at 6:22 PM David Rowley <david.rowley@2ndquadrant.com>
wrote:

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

The planner will just estimate the selectivity of now() - interval '10
days' by using DEFAULT_INEQ_SEL, which is 0.3333333333333333, so it
thinks it'll get 1/3rd of the table. Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

This surprised me a bit, and would have significant implications. I tested
a few different tables in our system and get the same row count estimate
with either WHERE condition. Perhaps I am missing a critical piece of what
you said.

explain
select * from charges where posted_on > now() - interval '10 days';

explain
select * from charges where posted_on > 'now'::timestamptz - interval '10
days';

#6David Rowley
david.rowley@2ndquadrant.com
In reply to: Michael Lewis (#5)
Re: BRIN index which is much faster never chosen by planner

On Fri, 11 Oct 2019 at 17:48, Michael Lewis <mlewis@entrata.com> wrote:

On Thu, Oct 10, 2019 at 6:22 PM David Rowley <david.rowley@2ndquadrant.com> wrote:

The planner will just estimate the selectivity of now() - interval '10
days' by using DEFAULT_INEQ_SEL, which is 0.3333333333333333, so it
thinks it'll get 1/3rd of the table. Using 'now' will allow the
planner to lookup actual statistics on that column which will likely
give a much better estimate, which by the looks of it, likely will
result in one of those BRIN index being used.

This surprised me a bit, and would have significant implications. I tested a few different tables in our system and get the same row count estimate with either WHERE condition. Perhaps I am missing a critical piece of what you said.

explain
select * from charges where posted_on > now() - interval '10 days';

explain
select * from charges where posted_on > 'now'::timestamptz - interval '10 days';

You're right. On looking more closely at the code, it uses
estimate_expression_value(), which performs additional constant
folding of expressions for selectivity purposes only. It does end up
calling the now() function and evaluating the now() - interval '10
days'; expression into a Const.

The header comment for that function reads:

* estimate_expression_value
*
* This function attempts to estimate the value of an expression for
* planning purposes. It is in essence a more aggressive version of
* eval_const_expressions(): we will perform constant reductions that are
* not necessarily 100% safe, but are reasonable for estimation purposes.

So I take back what I said about using 'now'::timestamptz instead of now().

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Jeremy Finzel
finzelj@gmail.com
In reply to: David Rowley (#4)
Re: BRIN index which is much faster never chosen by planner

On Thu, Oct 10, 2019 at 7:22 PM David Rowley <david.rowley@2ndquadrant.com>
wrote:

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

You are on to something here I think with the now() function, even if above
suggestion is not exactly right as you said further down. I am finding a
hard-coded timestamp gives the right query plan. I also tested same with
even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-# category, source, MIN(rec_insert_time) OVER (partition by
source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
unique_cases;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=24436329.10..24436343.56 rows=643 width=120)
-> Sort (cost=24436329.10..24436330.70 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24436286.24..24436299.10
rows=643 width=104)
-> Unique (cost=24436286.24..24436292.67 rows=643
width=124)
-> Sort (cost=24436286.24..24436287.85 rows=643
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24436256.25
rows=643 width=124)
Join Filter: ((l.category)::text =
filter.category)
-> Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8)
-> Materialize (cost=0.00..24420487.02
rows=643 width=99)
-> Seq Scan on log_table l
(cost=0.00..24420483.80 rows=643 width=99)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
?column?
-------------------------------
2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=19664576.17..19664590.63 rows=643 width=120)
-> Sort (cost=19664576.17..19664577.77 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=19664533.31..19664546.17
rows=643 width=104)
-> Unique (cost=19664533.31..19664539.74 rows=643
width=124)
-> Sort (cost=19664533.31..19664534.92 rows=643
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=3181.19..19664503.32
rows=643 width=124)
-> Gather (cost=3180.91..19662574.92
rows=643 width=99)
Workers Planned: 3
-> Parallel Bitmap Heap Scan on
log_table l (cost=2180.91..19661510.62 rows=207 width=99)
Recheck Cond: (rec_insert_time

= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)

Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name))
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2180.75 rows=142602171 width=0)
Index Cond:
(rec_insert_time >= '2019-10-01 08:20:38.115471-05'::timestamp with time
zone)

Let me know if this rings any bells! I will respond to other comments with
other replies.

Thanks,
Jeremy

#8Jeremy Finzel
finzelj@gmail.com
In reply to: Michael Lewis (#2)
Re: BRIN index which is much faster never chosen by planner

Dear Michael,

On Thu, Oct 10, 2019 at 5:20 PM Michael Lewis <mlewis@entrata.com> wrote:

Since the optimizer is choosing a seq scan over index scan when it seems
like it has good row estimates in both cases, to me that may mean costs of
scanning index are expected to be high. Is this workload on SSD? Has the
random_page_cost config been decreased from default 4 (compared with cost
of 1 unit for sequential scan)?

It's 1.5

Your buffer hits aren't great. What is shared_buffers set to? How much ram
on this cluster?

shared_buffers is 4GB. It has 500G of RAM, but server has several clusters
on it.

With this table being insert only, one assumes correlation is very high on
the data in this column as shown in pg_stats, but have your confirmed?

Yes, but the issue isn't with the BRIN index performing badly or being
fragmented. It's that it performs great (7x faster than the seq scan) but
postgres doesn't pick using it. I have seen this same issue also in other
attempts I have made to use BRIN.

To me, distinct ON is often a bad code smell and probably can be
re-written to be much more efficient with GROUP BY, lateral & order by, or
some other tool. Same with the window function. It is a powerful tool, but
sometimes not the right one.

I don't really agree, but it's beside the point because the issue is not in
aggregation. It's pre-aggregation. Indeed if I run my query as a simple
select (as I tried) it's the exact same planning issue. (In my experience,
distinct on for given example is the fastest. Same with window functions
which prevent inefficient self-joins)

Is "source" a function that is called on field1? What is it doing/how is
it defined?

I can't see how that matters either, but the "source" function is a mask
for a built-in pg function that is trivial. This whole query is masked so
as not to expose our actual prod query, but I hope it's still
understandable enough :).

My question is not how to make this query faster in general. It's that I
want to use BRIN indexes very much, but I'm not sure I can trust they will
scale with the right query plan like I know BTREE will.

Thanks!
Jeremy

#9Jeremy Finzel
finzelj@gmail.com
In reply to: Tomas Vondra (#3)
Re: BRIN index which is much faster never chosen by planner

On Thu, Oct 10, 2019 at 6:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

So this seems like a combination of multiple issues. Firstly, the bitmap
index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
might be worth increasing stats target on that column, or something like
that. Not sure, but it's about the only "fixable" thing here, I think.

In the OP I had mentioned that I already increased it to 5000, and it made
no difference. Ah fine.... let's go ahead and try 10000... still no change:

foo_prod=# ALTER TABLE log_table ALTER COLUMN rec_insert_time SET
STATISTICS 10000;
ALTER TABLE
foo_prod=# ANALYZE log_table;
ANALYZE
foo_prod=# EXPLAIN
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=24451299.20..24451313.21 rows=623 width=120)
-> Sort (cost=24451299.20..24451300.75 rows=623 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24451257.82..24451270.28
rows=623 width=104)
-> Unique (cost=24451257.82..24451264.05 rows=623
width=124)
-> Sort (cost=24451257.82..24451259.38 rows=623
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24451228.90
rows=623 width=124)
Join Filter: ((l.category)::text =
filter.category)
-> Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8)
-> Materialize (cost=0.00..24435949.31
rows=623 width=99)
-> Seq Scan on log_table l
(cost=0.00..24435946.20 rows=623 width=99)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

Thanks,
Jeremy

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeremy Finzel (#7)
Re: BRIN index which is much faster never chosen by planner

On Fri, Oct 11, 2019 at 09:08:05AM -0500, Jeremy Finzel wrote:

On Thu, Oct 10, 2019 at 7:22 PM David Rowley <david.rowley@2ndquadrant.com>
wrote:

The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.

You are on to something here I think with the now() function, even if above
suggestion is not exactly right as you said further down. I am finding a
hard-coded timestamp gives the right query plan. I also tested same with
even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-# category, source, MIN(rec_insert_time) OVER (partition by
source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
unique_cases;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=24436329.10..24436343.56 rows=643 width=120)
-> Sort (cost=24436329.10..24436330.70 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=24436286.24..24436299.10
rows=643 width=104)
-> Unique (cost=24436286.24..24436292.67 rows=643
width=124)
-> Sort (cost=24436286.24..24436287.85 rows=643
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=0.00..24436256.25
rows=643 width=124)
Join Filter: ((l.category)::text =
filter.category)
-> Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8)
-> Materialize (cost=0.00..24420487.02
rows=643 width=99)
-> Seq Scan on log_table l
(cost=0.00..24420483.80 rows=643 width=99)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
?column?
-------------------------------
2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=19664576.17..19664590.63 rows=643 width=120)
-> Sort (cost=19664576.17..19664577.77 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
-> Subquery Scan on unique_cases (cost=19664533.31..19664546.17
rows=643 width=104)
-> Unique (cost=19664533.31..19664539.74 rows=643
width=124)
-> Sort (cost=19664533.31..19664534.92 rows=643
width=124)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
-> Nested Loop (cost=3181.19..19664503.32
rows=643 width=124)
-> Gather (cost=3180.91..19662574.92
rows=643 width=99)
Workers Planned: 3
-> Parallel Bitmap Heap Scan on
log_table l (cost=2180.91..19661510.62 rows=207 width=99)
Recheck Cond: (rec_insert_time

= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)

Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name))
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..2180.75 rows=142602171 width=0)
Index Cond:
(rec_insert_time >= '2019-10-01 08:20:38.115471-05'::timestamp with time
zone)

Let me know if this rings any bells! I will respond to other comments with
other replies.

My guess - it's (at least partially) due to cpu_operator_cost,
associated with the now() call. When replaced with a literal, this cost
disappears and so the total query cost decreases.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Jeremy Finzel
finzelj@gmail.com
In reply to: Tomas Vondra (#3)
Re: BRIN index which is much faster never chosen by planner

The other issue is that the estimation of pages fetched using bitmap
heap scan is rather crude - but that's simply hard, and I don't think we
can fundamentally improve this.

I wanted to follow up on this specific issue. Isn't this the heart of the
matter and a fundamental problem? If I want to rely on BRIN indexes as in
a straightforward case as explained in OP, but I don't know if the planner
will be nearly reliable enough, how can I depend on them in production? Is
this not considered a planner bug or should this kind of case be documented
as problematic for BRIN? As another way to look at it: is there a
configuration parameter that could be added specific to BRIN or bitmapscan
to provide help to cases like this?

On freshly analyzed tables, I tried my original query again and found that
even with now() - 3 days it does not choose the BRIN index. In fact it
chose another btree on the table like (id1, id2, rec_insert_time). With
warm cache, the pg-chosen plan takes 40 seconds to execute, whereas when I
force a BRIN scan it takes only 4 seconds.

I could understand more if the execution times were close, but the actual
BRIN index is orders of magnitude faster than the plan Postgres is
choosing. I appreciate the feedback on this very much, as I am quite eager
to use BRIN indexes!!!

Thanks,
Jeremy

#12David Rowley
david.rowley@2ndquadrant.com
In reply to: Jeremy Finzel (#11)
Re: BRIN index which is much faster never chosen by planner

On Tue, 15 Oct 2019 at 08:43, Jeremy Finzel <finzelj@gmail.com> wrote:

I wanted to follow up on this specific issue. Isn't this the heart of the matter and a fundamental problem? If I want to rely on BRIN indexes as in a straightforward case as explained in OP, but I don't know if the planner will be nearly reliable enough, how can I depend on them in production? Is this not considered a planner bug or should this kind of case be documented as problematic for BRIN? As another way to look at it: is there a configuration parameter that could be added specific to BRIN or bitmapscan to provide help to cases like this?

On freshly analyzed tables, I tried my original query again and found that even with now() - 3 days it does not choose the BRIN index. In fact, it chose another btree on the table like (id1, id2, rec_insert_time). With warm cache, the pg-chosen plan takes 40 seconds to execute, whereas when I force a BRIN scan it takes only 4 seconds.

Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often. Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones. As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

You can see from the code below that the final selectivity strongly
influenced by the correlation value (REF: brincostestimate)

qualSelectivity = clauselist_selectivity(root, indexQuals,
baserel->relid,
JOIN_INNER, NULL);

/* work out the actual number of ranges in the index */
indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange),
1.0);

/*
* Now calculate the minimum possible ranges we could match with if all of
* the rows were in the perfect order in the table's heap.
*/
minimalRanges = ceil(indexRanges * qualSelectivity);

/*
* Now estimate the number of ranges that we'll touch by using the
* indexCorrelation from the stats. Careful not to divide by zero (note
* we're using the absolute value of the correlation).
*/
if (*indexCorrelation < 1.0e-10)
estimatedRanges = indexRanges;
else
estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges);

/* we expect to visit this portion of the table */
selec = estimatedRanges / indexRanges;

CLAMP_PROBABILITY(selec);

My overall view on this is that the BRIN index is not that great since
it's not eliminating that many rows by using it.

From above we see:

Bitmap Heap Scan on foo.log_table l (cost=2391.71..24360848.29 rows=735 width=99) (actual time=824.133..21329.054 rows=466 loops=1)

Output: <hidden>
Recheck Cond:
(l.rec_insert_time >= (now() - '10 days'::interval))
Rows Removed by Index Recheck: 8187584
Filter: ((l.field1 IS NOT NULL)
AND (l.category = 'music'::name))
Rows Removed by Filter: 19857107
Heap Blocks: lossy=1509000

So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages. It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column. If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see someone working on it. I'm
sure something better can be done. It's just not that easy to do with
the scant correlation data we have on the column.

As for is this a bug or something that's missing from the documents.
The documents do mention:

"BRIN stands for Block Range Index. BRIN is designed for handling very
large tables in which certain columns have some natural correlation
with their physical location within the table."

https://www.postgresql.org/docs/current/brin-intro.html

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13Jeremy Finzel
finzelj@gmail.com
In reply to: David Rowley (#12)
Re: BRIN index which is much faster never chosen by planner

Thank you for the thorough and thoughtful reply! Please see below.

On Mon, Oct 14, 2019 at 3:48 PM David Rowley <david.rowley@2ndquadrant.com>
wrote:

Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often. Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones. As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

So what I said originally (and light bulbs now going off) is that the table
is insert-only, but it is **pruned (deletes) to the past year of data**. I
think this is the key of what I've missed. Once vacuum runs, we have a
bunch of old physical space being re-used by new inserts, thus botching
that good correlation between physical and logical order. So it appears
the physical order of the data is indeed well-correlated in recent history,
but not so when you go back a bit further. Here are pg_stats:

-[ RECORD 1 ]----------+---------------------------
schemaname | foo
tablename | log_table
attname | rec_insert_time
inherited | f
null_frac | 0
avg_width | 8
n_distinct | 1.89204e+06
correlation | 0.193951
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |

I have not tried creating a fresh table, but if I modify my OP query to
instead take a window of 10 days 100 days ago, the BRIN index actually
performs really bad... identically to the seq scan:

Here is with a seq scan:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=26822167.70..26822170.60 rows=129 width=120) (actual
time=200730.856..200730.910 rows=62 loops=1)
-> Sort (cost=26822167.70..26822168.02 rows=129 width=104) (actual
time=200730.834..200730.837 rows=62 loops=1)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 33kB
-> Subquery Scan on unique_cases (cost=26822160.60..26822163.18
rows=129 width=104) (actual time=200730.672..200730.763 rows=62 loops=1)
-> Unique (cost=26822160.60..26822161.89 rows=129
width=124) (actual time=200730.670..200730.753 rows=62 loops=1)
-> Sort (cost=26822160.60..26822160.92 rows=129
width=124) (actual time=200730.668..200730.686 rows=395 loops=1)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 80kB
-> Nested Loop (cost=0.00..26822156.08
rows=129 width=124) (actual time=200692.321..200730.121 rows=395 loops=1)
Join Filter: ((l.category)::text =
filter.category)
Rows Removed by Join Filter: 552210
-> Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8) (actual time=0.013..0.179 rows=1399
loops=1)
-> Materialize (cost=0.00..26818970.84
rows=129 width=99) (actual time=136.942..143.440 rows=395 loops=1399)
-> Seq Scan on log_table l
(cost=0.00..26818970.20 rows=129 width=99) (actual
time=191581.193..200649.013 rows=395 loops=1)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '100
days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
Rows Removed by Filter:
315097963
Planning Time: 1.541 ms
Execution Time: 200731.273 ms
(19 rows)

Here is with the forced brin index scan:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=26674201.49..26674204.40 rows=129 width=120) (actual
time=200303.118..200303.177 rows=62 loops=1)
-> Sort (cost=26674201.49..26674201.82 rows=129 width=104) (actual
time=200303.093..200303.096 rows=62 loops=1)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
Sort Method: quicksort Memory: 33kB
-> Subquery Scan on unique_cases (cost=26674194.39..26674196.97
rows=129 width=104) (actual time=200302.918..200303.012 rows=62 loops=1)
-> Unique (cost=26674194.39..26674195.68 rows=129
width=124) (actual time=200302.914..200302.998 rows=62 loops=1)
-> Sort (cost=26674194.39..26674194.71 rows=129
width=124) (actual time=200302.911..200302.929 rows=395 loops=1)
Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
Sort Method: quicksort Memory: 80kB
-> Nested Loop (cost=1245.13..26674189.87
rows=129 width=124) (actual time=138087.290..200301.925 rows=395 loops=1)
-> Bitmap Heap Scan on log_table l
(cost=1244.85..26673801.66 rows=129 width=99) (actual
time=138087.083..200298.259 rows=395 loops=1)
Recheck Cond: ((rec_insert_time >=
(now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90
days'::interval)))
Rows Removed by Index Recheck:
214939302
Filter: ((field1 IS NOT NULL) AND
(category = 'music'::name))
Rows Removed by Filter: 15326889
Heap Blocks: lossy=13566000
-> Bitmap Index Scan on
rec_insert_time_brin_1000 (cost=0.00..1244.81 rows=78608872 width=0)
(actual time=678.203..678.203 rows=135660000 loops=1)
Index Cond: ((rec_insert_time

= (now() - '100 days'::interval)) AND (rec_insert_time <= (now() - '90

days'::interval)))
-> Index Only Scan using
small_join_table_pkey on small_join_table filter (cost=0.28..3.01 rows=1
width=8) (actual time=0.005..0.005 rows=1 loops=395)
Index Cond: (category =
(l.category)::text)
Heap Fetches: 395
Planning Time: 2.031 ms
Execution Time: 200304.411 ms
(23 rows)

So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages. It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

I agree btree would do the trick for performance, but I was trying to avoid
the near-100G overhead of such an index. For the given example, the
somewhat improved performance of BRIN may be acceptable to me. However, as
seen above, it appears that a btree may be my only option given this
workload to get reliable performance.

Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column. If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

Or deletes, as in my case.

On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see someone working on it. I'm
sure something better can be done. It's just not that easy to do with
the scant correlation data we have on the column.

As for is this a bug or something that's missing from the documents.
The documents do mention:

"BRIN stands for Block Range Index. BRIN is designed for handling very
large tables in which certain columns have some natural correlation
with their physical location within the table."

Yes, I was aware of this, and perhaps nothing indeed needs to change with
docs here given my case. But perhaps it would be worth exploring if there
could be more detailed stats on physical vs logical correlation, such as
when ANALYZE takes its samples, noting physical locations as well as
logical values, and allowing the correlation to take account of that more
detailed analysis. Of course, sounds like a huge amount of work with
uncertain benefits. In my case, it could be said that if I am always
querying the last few days of data, a BRIN index here is perfect, and a
BTREE is way overkill. That is a real use case to consider. But more
generally, I would drop the BRIN index if I had any other query patterns
beyond the few days of data.

However, this may be a fallacy. It might be that a few days from now, the
last 10 days of data will actually be really fragmented, depending only on
when VACUUM runs.

As the docs state, I do believe that the only use case that will work
really well for BRIN is either a truly insert-only table which is never
pruned (an idea I dislike as a DBA who wants us to keep OLTP data trim and
implement data retention policies :), or a table which is routinely
CLUSTERed!

Thanks again for the detailed feedback.

Thanks,
Jeremy

#14Michael Lewis
mlewis@entrata.com
In reply to: Jeremy Finzel (#13)
Re: BRIN index which is much faster never chosen by planner

Thanks for closing the loop on the data correlation question. I've been
playing with BRIN indexes on a log table of sorts and this thread helped
clear up some of the behavior I have been seeing.

I am curious, would a partial btree index fit your needs? Perhaps the
maintenance overhead is too significant or this is too off-the-wall, but a
daily job to create new index and drop the old concurrently could give the
performance you need while still saving the extra disk space of the full
btree on the timestamp.

CREATE INDEX CONCURRENTLY log_table_rec_insert_time_partial_10_04 ON
log_table USING btree ( rec_insert_time ) WHERE rec_insert_time >
'2019-10-04'::DATE;
DROP INDEX CONCURRENTLY IF EXISTS log_table_rec_insert_time_partial_10_03;

I would consider including category column as well, but I suspect that
would increase the size of the index significantly. Of course, this depends
on the query planner evaluating that "l.rec_insert_time >= now() - interval
'10 days'" and determining that the index fulfills the need.

Show quoted text
#15Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Lewis (#14)
Re: BRIN index which is much faster never chosen by planner

This reminds me of an issue I reported several years ago where Btree index
scans were chosen over seq scan of a large, INSERT-only table due to very high
correlation, but performed poorly. I concluded that use of the the high "large
scale" correlation on a large 50+GB table caused the planner to fail to account
for a larger number of pages being read nonsequentially (the opposite of your
issue). I think that's because we were INSERTing data which was at least
approximately sorted on record END time, and the index was on record START
time. For a large table with a week's data, the correlation of "start time"
was still be very high (0.99995). But scanning the index ends up reading pages
nonsequentially, and also multiple visits per page.

I eeked out a patch which made "correlation" a per-index statistic rather than
a per-column one. That means the planner could distinguish between a
freshly-built btree index and a fragmented one. (At the time, there was a
hypothesis that our issue was partially due to repeated values of the index
columns.) It didn't occur to me at the time, but that would also allow
creating numerous, partial BRIN indices, each of which would have separate
correlation computed over just their "restricted range", which *might* also
handle your case, depending on how packed your data is.

/messages/by-id/20170707234119.GN17566@telsasoft.com

On Tue, Oct 15, 2019 at 11:05:13AM -0500, Jeremy Finzel wrote:

I do believe that the only use case that will work really well for BRIN is
either a truly insert-only table which is never pruned ... or a table which
is routinely CLUSTERed!

Or partitioned table, which for large data sets I highly recommend instead of
DELETE.

Justin

#16David Rowley
david.rowley@2ndquadrant.com
In reply to: Jeremy Finzel (#13)
Re: BRIN index which is much faster never chosen by planner

On Wed, 16 Oct 2019 at 05:05, Jeremy Finzel <finzelj@gmail.com> wrote:

But perhaps it would be worth exploring if there could be more detailed stats on physical vs logical correlation, such as when ANALYZE takes its samples, noting physical locations as well as logical values, and allowing the correlation to take account of that more detailed analysis. Of course, sounds like a huge amount of work with uncertain benefits.

Yes. I think improving the statistics could be beneficial. It does
appear like the single value for the entire column is not fine-grained
enough for your use case.

As the docs state, I do believe that the only use case that will work really well for BRIN is either a truly insert-only table which is never pruned (an idea I dislike as a DBA who wants us to keep OLTP data trim and implement data retention policies :), or a table which is routinely CLUSTERed!

Have you considered partitioning the table so that the retention
policy could be implemented by dropping a partition rather than
performing a bulk DELETE?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17David Rowley
david.rowley@2ndquadrant.com
In reply to: Justin Pryzby (#15)
Re: BRIN index which is much faster never chosen by planner

On Wed, 16 Oct 2019 at 11:40, Justin Pryzby <pryzby@telsasoft.com> wrote:

It didn't occur to me at the time, but that would also allow
creating numerous, partial BRIN indices, each of which would have separate
correlation computed over just their "restricted range", which *might* also
handle your case, depending on how packed your data is.

Perhaps I've misunderstood you, but the correlation that's used is per
column, not per index. The only way to have it calculate multiple
correlations would be to partition the table. There'd be a correlation
for the column on each partition that way.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services