Index vacuum improvements

Started by Heikki Linnakangasalmost 20 years ago7 messages

hlinnaka@iki.fi

almost 20 years ago

Hi,

As we know, index vacuum could be sped up significantly if it didn't have
to lock every page in left to right direction because of the integrity issue
described in nbtree/README. We could then scan the index in physical
order, and AFAICS combine the btbulkdelete and btvacuumcleanup logic to
just one scan.

Several approaches have been proposed before, see prior discussion:

http://archives.postgresql.org/pgsql-hackers/2004-04/msg00829.php
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00927.php

The proposals this far have tried to solve the problem by changing the
split and/or vacuum algorithms to make sure that a tuple isn't deleted if
an index scan is stopped on it. That has proven to be hard, so I'm
proposing two alternatives that change the index scan instead:

1. Instead of stopping on the first matching tuple, scan the whole index
page for all matching entries at once. Then move one page to the
right, and logically stop before the first (not including hikey) item on
the page. Since the scan is not stopped on any specific tuple, it cannot
be deleted or moved. When the scan continues, it can just start scanning
from the beginning of the page.

This relies on the fact that items are never moved, except on page split
when they're always moved to the right and to a new page. Page deletion is
already done in two phases ensuring that the page doesn't get deleted
while a scan is stopped on it.

2. Alternatively, the index scan could store the location of the last
known non-deletable tuple it has encountered, in addition to the tuple it
stops on. When a stopped scan continues, it checks if the tuple it was
stopped on is still on the same page. If it's not, instead of moving
right to find it, relocate the last known non-deletable tuple and
continue the scan from there. There can't be any visible tuples between
the tuple it stopped on and the last known non-deletable tuple, because
we would have encountered it before, and would know by now that it's
non-deletable.

What do these ideas sound like? Anything I've missed?

- Heikki

Simon Riggs

simon@2ndquadrant.com

almost 20 years ago

In reply to: Heikki Linnakangas (#1)

Re: Index vacuum improvements

On Wed, 2006-03-29 at 21:48 +0300, Heikki Linnakangas wrote:

As we know, index vacuum could be sped up significantly if it didn't
have
to lock every page in left to right direction because of the integrity
issue
described in nbtree/README. We could then scan the index in physical
order, and AFAICS combine the btbulkdelete and btvacuumcleanup logic
to
just one scan.

First off, we need some good timings that show this effect. I believe
it, but we need some publicly discussable performance test cases to show
the effect and then show how much we've improved upon it, repeatably.

Initially, I'd suggest just trying to improve this situation by
pre-scanning the physical index files into OS filesystem cache (only) -
i.e. dont lock the files at all. That way, all I/O is sequential into
memory and then after that all random I/O will be logical. But it would
*all* need to fit in cache.

We might be able to improve the index FSM allocation algorithm so that
we improve the locality of logically adjacent blocks. That way a larger
than memory index would be able to be read with a limited cache. We
could then replace the full pre-read with just a limited sequential scan
ahead.

Maybe effective_cache_size could be a real parameter after all?

The existing FSM allocation scheme provides this for certain kinds of
tables, but not others.

Best Regards, Simon Riggs

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Heikki Linnakangas (#1)

Re: Index vacuum improvements

Heikki Linnakangas <hlinnaka@iki.fi> writes:

1. Instead of stopping on the first matching tuple, scan the whole index
page for all matching entries at once.

That loses the ability to reflect tuple deadness back into LP_DELETE
flags, no? Which is a problem already for bitmap indexscans, but I
don't wish to give it up for regular indexscans too. With a solution
for that it might be workable, but I don't see what we do about that.

2. Alternatively, the index scan could store the location of the last
known non-deletable tuple it has encountered, in addition to the tuple it
stops on. When a stopped scan continues, it checks if the tuple it was
stopped on is still on the same page. If it's not, instead of moving
right to find it, relocate the last known non-deletable tuple and
continue the scan from there. There can't be any visible tuples between
the tuple it stopped on and the last known non-deletable tuple, because
we would have encountered it before, and would know by now that it's
non-deletable.

This one appears to be assuming MVCC visibility semantics, which means
it will break system catalog operations, and probably foreign-key checks
too.

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

almost 20 years ago

In reply to: Tom Lane (#3)

Re: Index vacuum improvements

On Wed, 2006-03-29 at 16:49 -0500, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

1. Instead of stopping on the first matching tuple, scan the whole index
page for all matching entries at once.

That loses the ability to reflect tuple deadness back into LP_DELETE
flags, no? Which is a problem already for bitmap indexscans, but I
don't wish to give it up for regular indexscans too. With a solution
for that it might be workable, but I don't see what we do about that.

OK, I was thinking this would mean we'd need to scan the whole page
making this less efficient for nearly unique index access. But it
doesn't at all - we can still probe to start and scan from there.

Sequential access within the block would mean hardware cache prefetch
would kick-in for many scans.

If we did do this, the access would be so much more efficient that we
would have enough time to take additional actions to record LP_DELETE
flags, when dead tuples exist. Certainly it would be better to make a
single return visit to the block and record *all* LP_DELETEs found in
one go, rather than dirty the block once for each index_getnext and
potentially have it written out more than once as we scan it. Perhaps
use a heuristic of if > 3 LP_DELETEs found make a return visit to the
block to set them.

Best Regards, Simon Riggs

Heikki Linnakangas

hlinnaka@iki.fi

almost 20 years ago

In reply to: Tom Lane (#3)

Re: Index vacuum improvements

On Wed, 29 Mar 2006, Tom Lane wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

1. Instead of stopping on the first matching tuple, scan the whole index
page for all matching entries at once.

That loses the ability to reflect tuple deadness back into LP_DELETE
flags, no? Which is a problem already for bitmap indexscans, but I
don't wish to give it up for regular indexscans too. With a solution
for that it might be workable, but I don't see what we do about that.

At first glance, it doesn't look so hard. index_getmulti could mark
those tids that are dead, and btgetmulti would rescan the index page and
set LP_DELETE on all tuples that are still there.

We don't have to care about splits; if the index tuple is no longer where
it used to be, just ignore it. Right, no?

2. Alternatively, the index scan could store the location of the last
known non-deletable tuple it has encountered, in addition to the tuple it
stops on. When a stopped scan continues, it checks if the tuple it was
stopped on is still on the same page. If it's not, instead of moving
right to find it, relocate the last known non-deletable tuple and
continue the scan from there. There can't be any visible tuples between
the tuple it stopped on and the last known non-deletable tuple, because
we would have encountered it before, and would know by now that it's
non-deletable.

This one appears to be assuming MVCC visibility semantics, which means
it will break system catalog operations, and probably foreign-key checks
too.

True. Of course, we could special-case system catalogs. I don't know about
foreign keys, though. I've never looked at how it works.

- Heikki

Heikki Linnakangas

hlinnaka@iki.fi

almost 20 years ago

In reply to: Simon Riggs (#2)

Re: Index vacuum improvements

On Wed, 29 Mar 2006, Simon Riggs wrote:

First off, we need some good timings that show this effect. I believe
it, but we need some publicly discussable performance test cases to show
the effect and then show how much we've improved upon it, repeatably.

Yeah, a good vacuum benchmark would be nice, not so much for this specific
case but in general.

Initially, I'd suggest just trying to improve this situation by
pre-scanning the physical index files into OS filesystem cache (only) -
i.e. dont lock the files at all. That way, all I/O is sequential into
memory and then after that all random I/O will be logical. But it would
*all* need to fit in cache.

If the index is small enough to fit in memory, it's not so much of a
problem anyway...

We might be able to improve the index FSM allocation algorithm so that
we improve the locality of logically adjacent blocks. That way a larger
than memory index would be able to be read with a limited cache. We
could then replace the full pre-read with just a limited sequential scan
ahead.

That would be a good thing for index scan performance too.

Maybe effective_cache_size could be a real parameter after all?

The existing FSM allocation scheme provides this for certain kinds of
tables, but not others.

Can you elaborate, please? I couldn't find any evidence of that.

- Heikki

Tom Lane

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Heikki Linnakangas (#5)

Re: Index vacuum improvements

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On Wed, 29 Mar 2006, Tom Lane wrote:

That loses the ability to reflect tuple deadness back into LP_DELETE
flags, no?

At first glance, it doesn't look so hard. index_getmulti could mark
those tids that are dead, and btgetmulti would rescan the index page and
set LP_DELETE on all tuples that are still there.

We don't have to care about splits; if the index tuple is no longer where
it used to be, just ignore it. Right, no?

True --- as long as there's even a reasonable probability of the tuple
getting marked, we'll get the performance benefit. I don't see a way to
make it work for bitmap indexscans though --- by the time we visit the
heap, the index has long since forgotten where those index entries were.

I think this may be worth doing even disregarding any possible vacuum
speedup, simply because it'll reduce the number of index page lock/unlock
cycles needed during a regular indexscan.

regards, tom lane