rows estimate in explain analyze for the BRIN index

Started by Oleksii Kliukinabout 10 years ago14 messages

alexk@hintbits.com

about 10 years ago

Hi,

While experimenting with BRIN on PostgreSQL 9.5RC1 I came across the following plan (which is, btw a very good example of how BRIN rocks for the clustered data, the size of this table is around 90GB, the size of the index is around 3MB):

explain (analyze, buffers, verbose) select 1 from example where event_time = now() - interval '5 months';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on example (cost=744.44..757.64 rows=6 width=0) (actual time=73.895..73.895 rows=0 loops=1)
Output: 1
Recheck Cond: (example.event_time = (now() - '5 mons'::interval))
Rows Removed by Index Recheck: 4030
Heap Blocks: lossy=128
Buffers: shared hit=629
-> Bitmap Index Scan on example_event_time_idx1 (cost=0.00..744.41 rows=6 width=0) (actual time=70.335..70.335 rows=1280 loops=1)
Index Cond: (example.event_time = (now() - '5 mons'::interval))
Buffers: shared hit=501
Planning time: 0.125 ms
Execution time: 73.943 ms
(11 rows)

Time: 74.642 ms

Here EXPLAIN ANALYZE reports 1280 rows from the Bitmap Index Scan, but on the higher level, 4030 rows were removed by Index Recheck.

The questions are:

- how does it get 1280 rows from the BRIN index scan, given that BRIN only stores pointers to the heap blocks, not individual rows. Does it calculate the number of rows in the blocks returned?

- how comes that the subsequent recheck filters out 4030 rows, out of supposedly 1280 returned?

Kind regards,
--
Oleksii

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Oleksii Kliukin (#1)

Re: rows estimate in explain analyze for the BRIN index

Oleksii Kliukin <alexk@hintbits.com> writes:

Bitmap Heap Scan on example (cost=744.44..757.64 rows=6 width=0) (actual time=73.895..73.895 rows=0 loops=1)
Output: 1
Recheck Cond: (example.event_time = (now() - '5 mons'::interval))
Rows Removed by Index Recheck: 4030
Heap Blocks: lossy=128
Buffers: shared hit=629
-> Bitmap Index Scan on example_event_time_idx1 (cost=0.00..744.41 rows=6 width=0) (actual time=70.335..70.335 rows=1280 loops=1)
Index Cond: (example.event_time = (now() - '5 mons'::interval))
Buffers: shared hit=501

- how does it get 1280 rows from the BRIN index scan, given that BRIN only stores pointers to the heap blocks, not individual rows. Does it calculate the number of rows in the blocks returned?

It evidently returned 128 block IDs to the heapscan logic. I have not
looked at the code, but a reasonable bet is that it's just guessing that
there are 10 rows per block.

That seems like an awfully low number, though; it equates to assuming
that rows are 800 bytes wide on average. If we're going to use a fixed
number, 100 rows per block would probably be more nearly the correct
order of magnitude.

Another idea would be to use the heap's row density as calculated
by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
if relpages=0. This'd only be convenient if the bitmap scan node has
the parent heap rel open, which it might not.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oleksii Kliukin

alexk@hintbits.com

about 10 years ago

In reply to: Tom Lane (#2)

Re: rows estimate in explain analyze for the BRIN index

On 30 Dec 2015, at 17:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oleksii Kliukin <alexk@hintbits.com> writes:

Bitmap Heap Scan on example (cost=744.44..757.64 rows=6 width=0) (actual time=73.895..73.895 rows=0 loops=1)
Output: 1
Recheck Cond: (example.event_time = (now() - '5 mons'::interval))
Rows Removed by Index Recheck: 4030
Heap Blocks: lossy=128
Buffers: shared hit=629
-> Bitmap Index Scan on example_event_time_idx1 (cost=0.00..744.41 rows=6 width=0) (actual time=70.335..70.335 rows=1280 loops=1)
Index Cond: (example.event_time = (now() - '5 mons'::interval))
Buffers: shared hit=501

- how does it get 1280 rows from the BRIN index scan, given that BRIN only stores pointers to the heap blocks, not individual rows. Does it calculate the number of rows in the blocks returned?

It evidently returned 128 block IDs to the heapscan logic. I have not
looked at the code, but a reasonable bet is that it's just guessing that
there are 10 rows per block.

You are right, this is at the end of bringetbitmap in brin.c
/*
* XXX We have an approximation of the number of *pages* that our scan
* returns, but we don't have a precise idea of the number of heap tuples
* involved.
*/
PG_RETURN_INT64(totalpages * 10);

That seems like an awfully low number, though; it equates to assuming
that rows are 800 bytes wide on average. If we're going to use a fixed
number, 100 rows per block would probably be more nearly the correct
order of magnitude.

Another idea would be to use the heap's row density as calculated
by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
if relpages=0. This'd only be convenient if the bitmap scan node has
the parent heap rel open, which it might not.

Kind regards,
--
Oleksii

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Oleksii Kliukin (#3)

1 attachment(s)

Re: rows estimate in explain analyze for the BRIN index

Oleksii Kliukin <alexk@hintbits.com> writes:

On 30 Dec 2015, at 17:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Another idea would be to use the heap's row density as calculated
by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
if relpages=0. This'd only be convenient if the bitmap scan node has
the parent heap rel open, which it might not.

+1

Any objections to the attached?

regards, tom lane

Attachments:

brin-better-tuple-estimate.patchtext/x-diff; charset=us-ascii; name=brin-better-tuple-estimate.patchDownload

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2622a7e..64bf788 100644
*** a/src/backend/access/brin/brin.c
--- b/src/backend/access/brin/brin.c
*************** bringetbitmap(PG_FUNCTION_ARGS)
*** 279,286 ****
  	Relation	heapRel;
  	BrinOpaque *opaque;
  	BlockNumber nblocks;
  	BlockNumber heapBlk;
! 	int			totalpages = 0;
  	FmgrInfo   *consistentFn;
  	MemoryContext oldcxt;
  	MemoryContext perRangeCxt;
--- 279,287 ----
  	Relation	heapRel;
  	BrinOpaque *opaque;
  	BlockNumber nblocks;
+ 	int			heap_tuples_per_page;
  	BlockNumber heapBlk;
! 	int64		totalpages = 0;
  	FmgrInfo   *consistentFn;
  	MemoryContext oldcxt;
  	MemoryContext perRangeCxt;
*************** bringetbitmap(PG_FUNCTION_ARGS)
*** 291,301 ****
  
  	/*
  	 * We need to know the size of the table so that we know how long to
! 	 * iterate on the revmap.
  	 */
  	heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
  	heapRel = heap_open(heapOid, AccessShareLock);
  	nblocks = RelationGetNumberOfBlocks(heapRel);
  	heap_close(heapRel, AccessShareLock);
  
  	/*
--- 292,309 ----
  
  	/*
  	 * We need to know the size of the table so that we know how long to
! 	 * iterate on the revmap.  While we have it open, estimate the number of
! 	 * tuples per heap page for use later.
  	 */
  	heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
  	heapRel = heap_open(heapOid, AccessShareLock);
  	nblocks = RelationGetNumberOfBlocks(heapRel);
+ 	if (heapRel->rd_rel->relpages != 0 && heapRel->rd_rel->reltuples > 0)
+ 		heap_tuples_per_page = (int)
+ 			((double) heapRel->rd_rel->reltuples /
+ 			 (BlockNumber) heapRel->rd_rel->relpages);
+ 	else	/* if no info, assume 100-byte tuples */
+ 		heap_tuples_per_page = BLCKSZ / 100;
  	heap_close(heapRel, AccessShareLock);
  
  	/*
*************** bringetbitmap(PG_FUNCTION_ARGS)
*** 447,457 ****
  		ReleaseBuffer(buf);
  
  	/*
! 	 * XXX We have an approximation of the number of *pages* that our scan
! 	 * returns, but we don't have a precise idea of the number of heap tuples
! 	 * involved.
  	 */
! 	PG_RETURN_INT64(totalpages * 10);
  }
  
  /*
--- 455,465 ----
  		ReleaseBuffer(buf);
  
  	/*
! 	 * We have an approximation of the number of pages that our scan returns,
! 	 * but we don't have a precise idea of the number of heap tuples involved.
! 	 * We have to estimate based on average tuple density.
  	 */
! 	PG_RETURN_INT64(totalpages * heap_tuples_per_page);
  }
  
  /*

Oleksii Kliukin

alexk@hintbits.com

about 10 years ago

In reply to: Tom Lane (#4)

Re: rows estimate in explain analyze for the BRIN index

On 30 Dec 2015, at 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oleksii Kliukin <alexk@hintbits.com> writes:

On 30 Dec 2015, at 17:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Another idea would be to use the heap's row density as calculated
by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
if relpages=0. This'd only be convenient if the bitmap scan node has
the parent heap rel open, which it might not.

+1

Any objections to the attached?

Looks good to me. On my sample system with 100K rows, the new version gives me:

— CREATE TABLE test AS SELECT id FROM generate_series(1,100000) id;
— CREATE INDEX ON test USING brin(id);

postgres=# explain analyze select 1 from test where id = 500;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=12.01..16.02 rows=1 width=0) (actual time=0.199..4.220 rows=1 loops=1)
Recheck Cond: (id = 500)
Rows Removed by Index Recheck: 28927
Heap Blocks: lossy=128
-> Bitmap Index Scan on test_id_idx (cost=0.00..12.01 rows=1 width=0) (actual time=0.072..0.072 rows=28800 loops=1)
Index Cond: (id = 500)
Planning time: 0.433 ms
Execution time: 4.323 ms
(8 rows)

which is much closer to the actual number of rows removed by the index recheck + the one left.

--
Oleksii

Emre Hasegeli

emre@hasegeli.com

about 10 years ago

In reply to: Oleksii Kliukin (#5)

Re: rows estimate in explain analyze for the BRIN index

which is much closer to the actual number of rows removed by the index
recheck + the one left.

Is it better to be closer? We are saying those are the "actual"
values not the estimates. If we cannot provide the actual rows, I
think it is better to provide nothing. Something closer to the
reality would create more confusion. Maybe, we just just return the
number of blocks, and put somewhere a note about it. The actual row
count is already available on the upper part of the plan.

By the way, the estimation is a bigger problem than that. Please see
my patch [1]/messages/by-id/CAE2gYzzJvzPy-1cSgZjJyH69izSa13SEgFC4i4r2z0qBQ2P8Uw@mail.gmail.com about it.

[1]: /messages/by-id/CAE2gYzzJvzPy-1cSgZjJyH69izSa13SEgFC4i4r2z0qBQ2P8Uw@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Emre Hasegeli (#6)

Re: rows estimate in explain analyze for the BRIN index

Emre Hasegeli <emre@hasegeli.com> writes:

which is much closer to the actual number of rows removed by the index
recheck + the one left.

Is it better to be closer? We are saying those are the "actual"
values not the estimates. If we cannot provide the actual rows, I
think it is better to provide nothing.

Return zero you mean? Good point; there's precedent for that elsewhere
in EXPLAIN ANALYZE, IIRC. If we do what I propose in my patch, it would
possibly just lead to more questions like Oleksii's, because people would
be even more likely to believe that the number is accurate.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oleksii Kliukin

alexk@hintbits.com

about 10 years ago

In reply to: Emre Hasegeli (#6)

Re: rows estimate in explain analyze for the BRIN index

On 30 Dec 2015, at 18:38, Emre Hasegeli <emre@hasegeli.com> wrote:

which is much closer to the actual number of rows removed by the index
recheck + the one left.

Is it better to be closer? We are saying those are the "actual"
values not the estimates. If we cannot provide the actual rows, I
think it is better to provide nothing. Something closer to the
reality would create more confusion. Maybe, we just just return the
number of blocks, and put somewhere a note about it. The actual row
count is already available on the upper part of the plan.

I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “best effort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in terms of the number of tuples retrieved/thrown away)

We might still reflect in the documentation that the BRIN index cannot produce the exact number of rows during the bitmap scan and point people asking similar questions there.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Emre Hasegeli

emre@hasegeli.com

about 10 years ago

In reply to: Oleksii Kliukin (#8)

Re: rows estimate in explain analyze for the BRIN index

I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “best effort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in terms of the number of tuples retrieved/thrown away)

The number of retrieved and thrown away rows are already available on
the upper part of the plan. Bitmap Index Scan should provide the rows
that matched the index. Another alternative would be just returning
the number of matching pages (by not multiplying with 10). It might
be better understood. The users who can understand the EXPLAIN
ANALYZE output shouldn't be expecting BRIN to return rows.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Emre Hasegeli (#9)

Re: rows estimate in explain analyze for the BRIN index

Emre Hasegeli <emre@hasegeli.com> writes:

I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “best effort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in terms of the number of tuples retrieved/thrown away)

We do already have a nearby precedent for returning zero when we don't
have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes
do. (This is documented btw, at the bottom of section 14.1.)

The number of retrieved and thrown away rows are already available on
the upper part of the plan. Bitmap Index Scan should provide the rows
that matched the index.

It doesn't have that information.

Another alternative would be just returning
the number of matching pages (by not multiplying with 10). It might
be better understood.

No, it would not, at least not unless we found a way to explicitly mark
the output as being blocks not rows (which would doubtless break a lot of
existing client-side code). Zero is fairly clearly an impossible value,
whereas anything that's not zero is going to be taken at face value by
many users.

On balance I think likely the best thing to do is return zero, and
document that behavior.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Oleksii Kliukin

alexk@hintbits.com

about 10 years ago

In reply to: Tom Lane (#10)

Re: rows estimate in explain analyze for the BRIN index

On 30 Dec 2015, at 21:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Emre Hasegeli <emre@hasegeli.com> writes:

I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “best effort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in terms of the number of tuples retrieved/thrown away)

We do already have a nearby precedent for returning zero when we don't
have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes
do. (This is documented btw, at the bottom of section 14.1.)

+1 for following a precedent.

The number of retrieved and thrown away rows are already available on
the upper part of the plan. Bitmap Index Scan should provide the rows
that matched the index.

It doesn't have that information.

Another alternative would be just returning
the number of matching pages (by not multiplying with 10). It might
be better understood.

No, it would not, at least not unless we found a way to explicitly mark
the output as being blocks not rows (which would doubtless break a lot of
existing client-side code). Zero is fairly clearly an impossible value,
whereas anything that's not zero is going to be taken at face value by
many users.

But is it? Is it impossible for the BRIN bitmap index scan to return 0 rows (say, if the value being matched is outside the min/max boundary for every block range?) Granted, if we document that it always returns 0 and should be ignored, then confusing the actual 0 with the 0 as a representation of “unknown” would be less a problem.

--
Oleksii

#12

Alvaro Herrera

alvherre@2ndquadrant.com

about 10 years ago

In reply to: Tom Lane (#10)

Re: rows estimate in explain analyze for the BRIN index

Tom Lane wrote:

Emre Hasegeli <emre@hasegeli.com> writes:

I don’t see how to solve this problem without changing explain analyze output to accommodate for “unknown” value. I don’t think “0” is a non-confusing representation of “unknown” for most people, and from the practical standpoint, a “best effort” estimate is better than 0 (i.e. I will be able to estimate how efficient BRIN index is for my tables in terms of the number of tuples retrieved/thrown away)

We do already have a nearby precedent for returning zero when we don't
have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes
do. (This is documented btw, at the bottom of section 14.1.)

Hmm, but amgetbitmap is documented thusly:

The number of tuples fetched is returned (this might be just an
approximate count, for instance some AMs do not detect
duplicates).
http://www.postgresql.org/docs/9.5/static/index-functions.html

so I'm not sure we're actually violating an expectation that the number
of rows is exact. I mean, if we zero out this one, shouldn't we set it
to zero for these other documented cases too?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Alvaro Herrera (#12)

Re: rows estimate in explain analyze for the BRIN index

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

We do already have a nearby precedent for returning zero when we don't
have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes
do. (This is documented btw, at the bottom of section 14.1.)

Hmm, but amgetbitmap is documented thusly:

The number of tuples fetched is returned (this might be just an
approximate count, for instance some AMs do not detect
duplicates).
http://www.postgresql.org/docs/9.5/static/index-functions.html

so I'm not sure we're actually violating an expectation that the number
of rows is exact. I mean, if we zero out this one, shouldn't we set it
to zero for these other documented cases too?

Well, the code comments may say that, but the user-facing docs don't ...

More generally, people are accustomed to the idea that the planner's
estimated rowcounts are just estimates, but they're not accustomed to
the idea that the runtime "actual" rowcounts might be just estimates.
That seems like it's moving EXPLAIN ANALYZE's contract quite a bit,
even if there are some documents for internal APIs that say something
else.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Emre Hasegeli

emre@hasegeli.com

about 10 years ago

In reply to: Oleksii Kliukin (#11)

Re: rows estimate in explain analyze for the BRIN index

But is it? Is it impossible for the BRIN bitmap index scan to return 0 rows
(say, if the value being matched is outside the min/max boundary for every
block range?) Granted, if we document that it always returns 0 and should be
ignored, then confusing the actual 0 with the 0 as a representation of
“unknown” would be less a problem.

How about -1 ? It is an impossible value for sure. Maybe we should
change BitmapAnd and BitmapOr nodes, too. It is better to make it
obvious that it is not the correct value. I don't think many people
would notice the note on the documentation.

On the other hand, returning -1 broke parser of explain.depesz.com [1]http://explain.depesz.com/s/tAkd.

[1]: http://explain.depesz.com/s/tAkd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers