Index-only scan performance regression

Started by Dean Rasheedalmost 14 years ago10 messages
#1Dean Rasheed
dean.a.rasheed@gmail.com
1 attachment(s)

Given a freshly created table (not vacuumed), and a query that uses an
index-only scan, for example:

CREATE TABLE foo(a int PRIMARY KEY);
INSERT INTO foo SELECT * FROM generate_series(1,1000000);
ANALYSE foo;

EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=322.86..322.87 rows=1 width=0) (actual
time=23.646..23.646 rows=1 loops=1)
-> Index Only Scan using foo_pkey on foo (cost=0.00..300.42
rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
Index Cond: (a <= 10000)
Heap Fetches: 10000
Total runtime: 23.673 ms
(5 rows)

the actual performance is much worse than the equivalent index scan,
as used in 9.1 and earlier:

SET enable_indexonlyscan = off;
EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=322.86..322.87 rows=1 width=0) (actual
time=3.219..3.219 rows=1 loops=1)
-> Index Scan using foo_pkey on foo (cost=0.00..300.42 rows=8975
width=0) (actual time=0.014..2.302 rows=10000 loops=1)
Index Cond: (a <= 10000)
Total runtime: 3.240 ms
(4 rows)

Obviously this is the worst-case for an index-only scan, since there
is no visibility map, and it has to fetch each tuple from the heap,
but ideally this should perform around the same as an ordinary index
scan, since it's doing pretty much the same work.

Digging around, it looks like the additional cost is coming from
visibilitymap_test(), which is calling smgrexists() for each tuple, to
see if the visibility map file has been created. So it's doing a file
access check for each row, while the visibility map doesn't exist.

I'm not really familiar with this code, but a possible fix seems to be
to send an invalidation message in vm_extend() when it creates or
extends the visibility map, so that vm_readbuf() only has to re-check
the visibility map file if smgr_vm_nblocks has been invalidated. With
this change (attached) the worst-case index-only scan time becomes
around the same as the index scan time.

Regards,
Dean

Attachments:

index-only-scan.patchapplication/octet-stream; name=index-only-scan.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
new file mode 100644
index 76b3a46..472b780
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 88,93 ****
--- 88,94 ----
  #include "storage/bufmgr.h"
  #include "storage/lmgr.h"
  #include "storage/smgr.h"
+ #include "utils/inval.h"
  
  
  /*#define TRACE_VISIBILITYMAP */
*************** vm_readbuf(Relation rel, BlockNumber blk
*** 486,497 ****
  
  	/*
  	 * If we haven't cached the size of the visibility map fork yet, check it
! 	 * first.  Also recheck if the requested block seems to be past end, since
! 	 * our cached value might be stale.  (We send smgr inval messages on
! 	 * truncation, but not on extension.)
  	 */
! 	if (rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber ||
! 		blkno >= rel->rd_smgr->smgr_vm_nblocks)
  	{
  		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
  			rel->rd_smgr->smgr_vm_nblocks = smgrnblocks(rel->rd_smgr,
--- 487,497 ----
  
  	/*
  	 * If we haven't cached the size of the visibility map fork yet, check it
! 	 * first.  This will also catch any changes to the visibility map's size
! 	 * made by other backends, since we send smgr inval messages on create,
! 	 * extend and truncate.
  	 */
! 	if (rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber)
  	{
  		if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
  			rel->rd_smgr->smgr_vm_nblocks = smgrnblocks(rel->rd_smgr,
*************** vm_extend(Relation rel, BlockNumber vm_n
*** 560,565 ****
--- 560,575 ----
  
  	vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
  
+ 	/*
+ 	 * Send a shared-inval message to force other backends to close any smgr
+ 	 * references they may have for this rel, which we are about to change.
+ 	 * This is a useful optimization because it means that backends don't have
+ 	 * to keep checking for creation or extension of the file, which happens
+ 	 * infrequently.
+ 	 */
+ 	CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
+ 
+ 	/* Now extend the file */
  	while (vm_nblocks_now < vm_nblocks)
  	{
  		smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
#2Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#1)
Re: Index-only scan performance regression

On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Given a freshly created table (not vacuumed), and a query that uses an
index-only scan, for example:

CREATE TABLE foo(a int PRIMARY KEY);
INSERT INTO foo SELECT * FROM generate_series(1,1000000);
ANALYSE foo;

EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
time=23.646..23.646 rows=1 loops=1)
  ->  Index Only Scan using foo_pkey on foo  (cost=0.00..300.42
rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
        Index Cond: (a <= 10000)
        Heap Fetches: 10000
 Total runtime: 23.673 ms
(5 rows)

the actual performance is much worse than the equivalent index scan,
as used in 9.1 and earlier:

SET enable_indexonlyscan = off;
EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
time=3.219..3.219 rows=1 loops=1)
  ->  Index Scan using foo_pkey on foo  (cost=0.00..300.42 rows=8975
width=0) (actual time=0.014..2.302 rows=10000 loops=1)
        Index Cond: (a <= 10000)
 Total runtime: 3.240 ms
(4 rows)

Obviously this is the worst-case for an index-only scan, since there
is no visibility map, and it has to fetch each tuple from the heap,
but ideally this should perform around the same as an ordinary index
scan, since it's doing pretty much the same work.

Yeah.

Digging around, it looks like the additional cost is coming from
visibilitymap_test(), which is calling smgrexists() for each tuple, to
see if the visibility map file has been created. So it's doing a file
access check for each row, while the visibility map doesn't exist.

I'm not really familiar with this code, but a possible fix seems to be
to send an invalidation message in vm_extend() when it creates or
extends the visibility map, so that vm_readbuf() only has to re-check
the visibility map file if smgr_vm_nblocks has been invalidated. With
this change (attached) the worst-case index-only scan time becomes
around the same as the index scan time.

I think that the additional sinval message might not do much good,
because it won't necessarily be seen until the next time any given
other backend does AcceptInvalidationMessages(), which might be a
while. That's not an issue for truncation because it requires
AccessExclusiveLock, so we know that any subsequent access to the
table will involve taking a heavyweight lock, which will
AcceptInvalidationMessages(). Making vismap extension require
AccessExclusiveLock would be a cure worse than the disease.

In the case of an index-only scan, it wouldn't be the end of the world
if we failed to notice that the relation had been extended. The worst
case would be that we wouldn't read the visibility map buffer and, as
you say, that shouldn't be any worse than a normal index scan. The
only time we had better not fail to notice that the visibility map has
been extended is when we're asked to clear a bit, because failure to
notice that the new page is there - with a set bit that must now be
cleared - could result in queries returning wrong answers. So one
possibility would be to tweak the relcache stuff to have a flag
indicating whether the value is cached, and make that separate from
the cached value. Then, if the value is cached, and we're doing
anything other than clearing a bit, we just say, eh, sorry, page isn't
there. If we're clearing a bit then we go to the trouble of
revalidating.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Robert Haas (#2)
Re: Index-only scan performance regression

On 31 January 2012 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

Given a freshly created table (not vacuumed), and a query that uses an
index-only scan, for example:

CREATE TABLE foo(a int PRIMARY KEY);
INSERT INTO foo SELECT * FROM generate_series(1,1000000);
ANALYSE foo;

EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
time=23.646..23.646 rows=1 loops=1)
  ->  Index Only Scan using foo_pkey on foo  (cost=0.00..300.42
rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
        Index Cond: (a <= 10000)
        Heap Fetches: 10000
 Total runtime: 23.673 ms
(5 rows)

the actual performance is much worse than the equivalent index scan,
as used in 9.1 and earlier:

SET enable_indexonlyscan = off;
EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;

                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
time=3.219..3.219 rows=1 loops=1)
  ->  Index Scan using foo_pkey on foo  (cost=0.00..300.42 rows=8975
width=0) (actual time=0.014..2.302 rows=10000 loops=1)
        Index Cond: (a <= 10000)
 Total runtime: 3.240 ms
(4 rows)

Obviously this is the worst-case for an index-only scan, since there
is no visibility map, and it has to fetch each tuple from the heap,
but ideally this should perform around the same as an ordinary index
scan, since it's doing pretty much the same work.

Yeah.

Digging around, it looks like the additional cost is coming from
visibilitymap_test(), which is calling smgrexists() for each tuple, to
see if the visibility map file has been created. So it's doing a file
access check for each row, while the visibility map doesn't exist.

I'm not really familiar with this code, but a possible fix seems to be
to send an invalidation message in vm_extend() when it creates or
extends the visibility map, so that vm_readbuf() only has to re-check
the visibility map file if smgr_vm_nblocks has been invalidated. With
this change (attached) the worst-case index-only scan time becomes
around the same as the index scan time.

I think that the additional sinval message might not do much good,
because it won't necessarily be seen until the next time any given
other backend does AcceptInvalidationMessages(), which might be a
while.  That's not an issue for truncation because it requires
AccessExclusiveLock, so we know that any subsequent access to the
table will involve taking a heavyweight lock, which will
AcceptInvalidationMessages().  Making vismap extension require
AccessExclusiveLock would be a cure worse than the disease.

In the case of an index-only scan, it wouldn't be the end of the world
if we failed to notice that the relation had been extended.  The worst
case would be that we wouldn't read the visibility map buffer and, as
you say, that shouldn't be any worse than a normal index scan.  The
only time we had better not fail to notice that the visibility map has
been extended is when we're asked to clear a bit, because failure to
notice that the new page is there - with a set bit that must now be
cleared - could result in queries returning wrong answers.  So one
possibility would be to tweak the relcache stuff to have a flag
indicating whether the value is cached, and make that separate from
the cached value.  Then, if the value is cached, and we're doing
anything other than clearing a bit, we just say, eh, sorry, page isn't
there.  If we're clearing a bit then we go to the trouble of
revalidating.

Thoughts?

In the case when we're asked to clear a bit, it would first try to pin
the relevant page, which would go through vm_readbuf() with extend set
to true. Then vm_extend() would notice that the visibility map had
already been extended, and it would read in the new page with the set
bit. So this case would continue to work, wouldn't it?

Regards,
Dean

#4Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#3)
Re: Index-only scan performance regression

On Tue, Jan 31, 2012 at 2:15 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

In the case when we're asked to clear a bit, it would first try to pin
the relevant page, which would go through vm_readbuf() with extend set
to true. Then vm_extend() would notice that the visibility map had
already been extended, and it would read in the new page with the set
bit. So this case would continue to work, wouldn't it?

Ah, maybe. I haven't tried to code it up. Do you want to take a crack at it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Robert Haas (#4)
Re: Index-only scan performance regression

On 31 January 2012 19:49, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 31, 2012 at 2:15 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

In the case when we're asked to clear a bit, it would first try to pin
the relevant page, which would go through vm_readbuf() with extend set
to true. Then vm_extend() would notice that the visibility map had
already been extended, and it would read in the new page with the set
bit. So this case would continue to work, wouldn't it?

Ah, maybe.  I haven't tried to code it up.  Do you want to take a crack at it?

Well the patch upthread works when it comes to reducing the cost of
testing bits in the visibility map by not bothering to look for pages
which appear to be past the end of the file. And when pinning pages
for setting or clearing bits, it still checks for new pages created by
other backends. The thing I'm unsure about is whether sending sinval
messages when the visibility map is extended is a good idea. It could
work without that, but then a backend would never notice that the
visibility map had been created or extended by another process, so
it's index-only scans would continue to perform like normal index
scans. The sinval messages would cure that (eventually) but I'm not
sure what other side-effects that might have.

Regards,
Dean

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#5)
Re: Index-only scan performance regression

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

The thing I'm unsure about is whether sending sinval
messages when the visibility map is extended is a good idea.

Seems perfectly reasonable to me. They'd occur so seldom as to be
more than repaid if we can scrape some cost out of the mainline paths.

The real objection to this probably is that if it only saves anything
for tables that don't have a VM yet, it's dubious whether it's worth
doing. But if we can avoid wasted checks for VM extension as well,
then I think it's probably a no-brainer.

regards, tom lane

#7Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#6)
Re: Index-only scan performance regression

On 31 January 2012 23:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

The thing I'm unsure about is whether sending sinval
messages when the visibility map is extended is a good idea.

Seems perfectly reasonable to me.  They'd occur so seldom as to be
more than repaid if we can scrape some cost out of the mainline paths.

OK, thanks. That's good.

The real objection to this probably is that if it only saves anything
for tables that don't have a VM yet, it's dubious whether it's worth
doing.  But if we can avoid wasted checks for VM extension as well,
then I think it's probably a no-brainer.

                       regards, tom lane

Yes it applies in the same way to VM extension - if the table has
grown and the VM has not yet been extended, but I don't see why that
is any worse than the case of not having a VM yet.

Actually I think that this is not such an uncommon case - for a table
which has only had data inserted - no deletes or updates - it is
tempting to think that ANALYSE is sufficient, and that there is no
need to VACUUM. If it were simply the case that this caused an
index-only scan to have no real benefit, you might be willing to live
with normal index scan performance. But actually it causes a very
significant performance regression beyond that, to well below 9.1
performance.

Regards,
Dean

#8Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#7)
Re: Index-only scan performance regression

On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

The real objection to this probably is that if it only saves anything
for tables that don't have a VM yet, it's dubious whether it's worth
doing.  But if we can avoid wasted checks for VM extension as well,
then I think it's probably a no-brainer.

Yes it applies in the same way to VM extension - if the table has
grown and the VM has not yet been extended, but I don't see why that
is any worse than the case of not having a VM yet.

Actually I think that this is not such an uncommon case - for a table
which has only had data inserted - no deletes or updates - it is
tempting to think that ANALYSE is sufficient, and that there is no
need to VACUUM. If it were simply the case that this caused an
index-only scan to have no real benefit, you might be willing to live
with normal index scan performance. But actually it causes a very
significant performance regression beyond that, to well below 9.1
performance.

So, I guess the trade-off here is that, since sinval messages aren't
processed immediately, we often won't notice the VM extension until
the next statement starts, whereas with the current implementation, we
notice it right away. On the other hand, noticing it right away is
costing us a system call or two per tuple. So on further thought, I
think we should do this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#8)
Re: Index-only scan performance regression

On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2012 at 4:09 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

The real objection to this probably is that if it only saves anything
for tables that don't have a VM yet, it's dubious whether it's worth
doing.  But if we can avoid wasted checks for VM extension as well,
then I think it's probably a no-brainer.

Yes it applies in the same way to VM extension - if the table has
grown and the VM has not yet been extended, but I don't see why that
is any worse than the case of not having a VM yet.

Actually I think that this is not such an uncommon case - for a table
which has only had data inserted - no deletes or updates - it is
tempting to think that ANALYSE is sufficient, and that there is no
need to VACUUM. If it were simply the case that this caused an
index-only scan to have no real benefit, you might be willing to live
with normal index scan performance. But actually it causes a very
significant performance regression beyond that, to well below 9.1
performance.

So, I guess the trade-off here is that, since sinval messages aren't
processed immediately, we often won't notice the VM extension until
the next statement starts, whereas with the current implementation, we
notice it right away.  On the other hand, noticing it right away is
costing us a system call or two per tuple.  So on further thought, I
think we should do this.

Patch committed. I moved the smgr inval to after the actual extension
is done, which seems superior, and adjusted the comments slightly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Robert Haas (#9)
Re: Index-only scan performance regression

On 2 February 2012 01:40, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2012 at 7:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

So, I guess the trade-off here is that, since sinval messages aren't
processed immediately, we often won't notice the VM extension until
the next statement starts, whereas with the current implementation, we
notice it right away.  On the other hand, noticing it right away is
costing us a system call or two per tuple.  So on further thought, I
think we should do this.

Yes, that's a nice summary.

Patch committed.  I moved the smgr inval to after the actual extension
is done, which seems superior, and adjusted the comments slightly.

Thanks.

Regards,
Dean