Brain dump: btree collapsing

Started by Tom Laneabout 23 years ago44 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

I've been thinking hard for the last few days about how to do space
reclamation in b-tree indexes, i.e., recycle pages that are in
no-longer-useful portions of the tree structure. We know we need this to
solve the "index bloat" problem that occurs when the distribution of keys
changes over time. I feel that it's critical that the reclamation be doable
by plain VACUUM, ie, without acquiring exclusive lock on the index as a
whole. This discussion therefore assumes that page deletion must be able
to operate in parallel with insertions and searches.

Issues
------

We need to get rid of parent links in btree pages; otherwise removal of a
non-leaf page implies we must find and update all parent links that lead
to it. This is messy enough that it would be better to do without. The
only thing the parent link is really needed for is to find the new parent
level after a root split, and we can handle that (very infrequent) case by
re-descending from the new root.

Instead of storing parent links, label all pages with level (counting
levels up from zero = leaf, so that a page's level doesn't change in a
root split). Then, if we find ourselves needing to re-descend, we can be
sure of finding the correct parent level, one above where we were, even if
there's been multiple root splits meanwhile. The level will also make for
a useful cross-check on link validity: we will always know the level of
the page we expect to arrive at when following a link, so we can check
that the page has the right level.

Unfortunately this means tossing out most of the FixBtree code Vadim wrote
2 years ago, because it seems critically dependent on having parent links.
But I don't really see why we need it if we rely on WAL to maintain btree
consistency. That will require some improvements in WAL-logging for
btrees, however. (Details below.)

When a page is deleted, it can't actually be recycled until there are no
more potentially in-flight references to it (ie, readers that intend to
visit the page but have not yet acquired a lock on it). Those readers
must be able to find the page, realize it's dead, and follow the correct
sidelink from it. [Lanin&Shasha86] describe the "drain technique", which
they define as "delay freeing the empty page until the termination of all
processes whose locate phase began when pointers to the page still
existed". We can conveniently implement this by reference to
transactions: after removing all links to the page, label the now-dead
page with the current contents of the next-transaction-ID counter. VACUUM
can recycle the page when this is older than the oldest open transaction.

Instead of an actively maintained freelist on disk as per Alvaro Herrera's
patch, I plan to use the FSM to remember where recyclable pages are, much
as we do for tables. The FSM space requirements would be small, since
we'd not be needing to enter any data about partially-full pages; only
truly empty, recyclable pages would need to be stored. (Is it worth
having an alternate representation in the FSM for indexes, so that we only
store page numbers and not the useless amount-free statistic?)

Without a freelist on disk, VACUUM would need to scan indexes linearly to
find dead pages, but that seems okay; I'm thinking of doing that anyway to
look for empty pages to turn into dead ones.

Restructuring the tree during page deletion
-------------------------------------------

We will delete only completely-empty pages. If we were to merge nearly-empty
pages by moving data items from one page to an adjacent one, this would
imply changing the parent's idea of the bounding key between them ---
which is okay if we are just deleting an internal key in the parent, but
what if the pages have different parent pages? We'd have to adjust the
parents' own bounding key, meaning the parents' parent changes, perhaps
all the way to the root. (Not to mention that with variable-size keys,
there's no guarantee we can make such changes without splitting the
upper-level pages.) And, since we support both forward and backward
index scans, we can't move leaf items in either direction without risking
having a concurrent scan miss them. This is way too messy, especially for
something that has only minimal return according to the literature
[Johnson89]. So, no merging.

Deletion of an empty page only requires removal of the parent's item
linking to it (plus fixing side pointers, which is pretty trivial). We
also remove the next higher key in the parent, which is the parent's upper
bound for data that would have belonged on the target page. Therefore,
the page's right sibling becomes responsible for storing the key range
that used to belong on the deleted page.

What if there is no next-higher key, you ask? Well, I'm going to punt.
It is critical that the key space associated with a parent page match the key
space associated with its children (eg, the high key of the rightmost child
must match the parent's high key). There is no way to atomically modify a
parent's key space --- this would mean simultaneously changing keys in upper
levels, perhaps all the way up to the root. To avoid needing to do that, we
must put a couple of restrictions on deletion:

1. The rightmost page in any tree level is never deleted, period. (This rule
also simplifies traversal algorithms, as we'll see below.)

2. The rightmost child of a non-rightmost parent page can't be deleted, either,
unless it is the last child of that parent. If it is the last child then the
parent is *immediately* marked as half-dead, meaning it can never acquire any
new children; its key space implicitly transfers to its right sibling. (The
parent can then be deleted in a later, separate atomic action. Note that if
the parent is itself a rightmost child of a non-rightmost parent, it will have
to stay in the half-dead state until it becomes the only child; then it can be
deleted and its parent becomes half-dead.)

(Note that while leaf pages can be empty and still alive, upper pages
can't: they must have children to delegate their key range to.)

With these restrictions, there is no need to alter the "high key" fields of
either the parent or the siblings of a page being deleted. The key space of
the page itself transfers to its right sibling, and the key space of the
parent does not change (except in the case where the parent loses all its key
space to its right sibling and becomes half-dead).

Restriction 1 is not a significant one in terms of space wastage. Restriction
2 is more annoying, but it should not increase overhead drastically. The
parent would not be deleted or merged anyway as long as it has other children,
so the worst-case overhead from this restriction is one extra page per
parent page --- and parent levels are normally much smaller than child levels.

The notion of a half-dead page means that the key space relationship between
the half-dead page's level and its parent's level may be a little out of
whack: key space that appears to belong to the half-dead page's parent on the
parent level may really belong to its right sibling. We can tolerate this,
however, because insertions and deletions on upper tree levels are always
done by reference to child page numbers, not keys. The only cost is that
searches may sometimes descend to the half-dead page and then have to move
right, rather than going directly to the sibling page.

Page deletion procedure
-----------------------

Assume we know the target page, but hold no locks. The locks must be
gotten in this order to avoid deadlock against inserters:

1. Obtain W lock on left sibling of target page (this may require searching
right if left sibling splits concurrently; same as for backwards scan case).
Skip this if target is leftmost.
2. Obtain super-exclusive lock on target page; check it is still empty,
else abandon deletion.
3. Obtain W lock on right sibling (there must be one, else we can't delete).
4. Find and W-lock the target's current parent page. Check that target is
not rightmost child of a parent with other children, else abandon deletion.
5. Update all four pages at once, write single WAL record describing all
updates. (I believe we can extend WAL to allow 4 page references in a
single record, if not it may be okay to cheat and not store all of the
adjacent pages. Certainly need not record contents of the target page
itself, so 3 is enough.) Note parent is marked half-dead here if this
was its last child. Release locks.
6. The update to the target page makes it empty and marked dead, but preserves
its sibling links. It can't actually be recycled until later (see above).

The deletion procedure could be triggered immediately upon removal of the
last item in a page, or when the next VACUUM scan finds an empty page.
Not sure yet which way is better.

Having to exclusive-lock four pages is annoying, but I think we must do
this procedure as a single atomic operation to ensure consistency.

Search/scan rules in presence of deletion
-----------------------------------------

Since page deletion takes a superexclusive lock, a stopped scan will never
find the page it's on deleted when it resumes. What we do have to worry about
is arriving at a dead page after following a link. There are several cases:

During initial search for a key, if arrive at a dead or half-dead page, chain
right. (The key space must have moved to the right.)

During forward scan: likewise chain right. (Note: because scans operate
only on the leaf level, they should never see half-dead pages.)

During backwards scan: chain right until a live page is found, and step left
from there in the usual way. We cannot use the dead page's left-link because
its left neighbor may have split since the page was deleted. (There is
certain to be at least one live page to the right, since we never delete the
rightmost page of a level.)
Also, "step left in the usual way" used to mean this:
A backwards scan has one additional bit of complexity: after
following the left-link we must account for the possibility that the
left sibling page got split before we could read it. So, we have to
move right until we find a page whose right-link matches the page we
came from.
But if the "page we came from" is now dead, perhaps there is no
page with a matching right-link (if updater got to it before we did)!
Probably the best policy, if we fail to find a match, is to return to
the page we were previously on, verify that it's now dead (else error),
then chain right and left as in the case where we've linked to a dead page
(see above). This means a backwards scan must always remember the last
live page it was on. Again, this is greatly simplified by the fact that
there must be a live page somewhere to the right.

Insertion changes
-----------------

The basic insertion algorithm doesn't change (but see above notes about
the search part).

bt_getstackbuf is a little trickier in the presence of deletion. First
issue is that removal of an earlier downlink in the returned-to parent
page may cause the target item to move left by some number of slots
(though not into a previous page). We can deal with this by searching
left after we fail to find the item by searching right, before we move on
to the next page. Next is that the whole parent page may be dead or
half-dead --- but it can't be physically deleted, so we can recover by
following its rightlink. The nasty part is that the target item could
perhaps not be there; this would imply that someone deleted the page we
are trying to split, or hasn't fully finished inserting it. But L&Y
already require holding a lock on that page until we have re-located and
locked the parent item, so this is not possible. (Note this implies that
we must search for exactly the downlink to the page we are splitting, but
that's true anyway.)

If we need to split on a level that was the root when we descended, but
is no longer the root, then our stack doesn't tell us how to find the next
level up. As discussed above, handle this case by re-descending and
checking level fields, rather than depending on parent links as before.
We will simply descend on the left edge of the tree and scan the whole
parent level to find where to insert --- it's highly unlikely that the
parent level has yet grown large enough for this to be slow, so there's
not much value in trying to use a keyed search.

WAL issues
----------

A page split operation can't really be atomic. We can handle the actual split
("half-split") operation as an atomic update:

1. We have W lock on page that needs to be split.
2. Find a free page (from FSM, or make it by extending rel), W-lock it.
3. W-lock page's right sibling (if any), so that we can fix its left-link.
4. Update all three pages, make WAL log entry describing same, write pages.
5. We can now release the W-lock on the right sibling page, but we must keep
the W-locks on the two split pages until we have made the parent entry;
else it's conceivable for someone else to try to split or delete these
pages and get confused because the parent link isn't in place yet.
(See L&Y.)

(Note that since this is driven by an insert, the half-split WAL log entry
may as well include the new key insertion step; so this entry substitutes
for the plain insertion WAL entry we'd otherwise make.)

(So far this is the same as the existing code.)

But now we must find the parent page and enter the new key and downlink
(possibly causing a split at that level). If crash occurs before we can
do so, how to recover? The existing code has a lot of ad-hoc logic that
tries to reconstruct the tree on-the-fly when inconsistency is detected.
But it would be a lot better if we could expect WAL playback to fix things.

The WAL entry describing the half-split includes enough info to re-execute
the parent key insertion. While replaying log, we can remember this
insertion as pending. If we see a matching insertion event in the log,
discard the remembered pending insertion. At end of log, execute all
still-pending insertions. (There should not be very many pending
insertions at any one time, so the resources for this are not onerous.)
Note however that this implies making more log entries to describe these
additional updates; might complicate WAL, but I see no fundamental
difficulty.

Also, if we are splitting a root page, it seems best to merge creation of
the new root page and updating of the metapage into the same atomic action
that does the split. This won't make the WAL entry materially larger, and
it will avoid problems with concurrent root splits. Need to look at
deadlock issues for trying to update metapage --- is it okay to grab W
lock on meta while holding it on old root? (Sure, why not? Nothing will
go the other way.) We do need to update metapage as part of the atomic
root-split operation, else other updaters trying to find the new root may
fail because link not there yet.

VACUUM mechanics
----------------

We will institute an explicit "vacuuming an index" phase in VACUUM (this
is distinct from deleting index entries during vacuuming of a table, and
will be done after we've finished vacuuming the associated table). In
this phase, we can scan the index linearly (ie, in storage order not
index order) to take advantage of sequential I/O optimizations. We scan
looking for pages that are empty (if leaf pages) or half-dead (if parent
pages), as well as pages that are already dead and have been so long
enough to be recycled. The latter are simply reported to the FSM.
Empty and half-dead pages are deleted per the previously-described
mechanism. (It seems best to gather the page IDs of target pages in
memory, and do the deletions only after we've finished the seqscan, so
as not to confuse the kernel about whether it should read-ahead.)

One step that's not explicit above is how to find the parent of a page we
intend to delete. It seems most efficient to perform a search for the
page's high key (it must have one, since we don't delete rightmost pages)
stopping when we reach the level above the page itself. This will give
us a good pointer to the parent. We should do this before starting to
acquire exclusive locks for the deletion.

In theory, if we find recyclable page(s) at the physical end of the index,
we could truncate the file (ie, give the space back to the filesystem)
instead of reporting these pages to FSM. I am not sure if this is worth
doing --- in most cases it's likely that little space can be released this
way, and there may be some tricky locking issues.

Tree depth
----------

Because we never relabel pages' levels, the tree depth cannot be reduced (we'd
have to do that by removing the current root page, which seems impractical
without an exclusive lock on the whole index). So after massive shrinkage
we could end up with a "thin" tree in which there are levels below the root
with only one page apiece. The space wastage doesn't seem like a big issue,
but the extra time to traverse these useless levels could be annoying.

This could be ignored in first implementation (there's always REINDEX).
Later, possibly handle it via Lanin&Shasha's notion of a critic (think
VACUUM) that sets a fast pointer to the current effective root level.
(Actually I think we wouldn't need a separate critic process; split and
delete steps could be programmed to update the fast pointer for
themselves, in a separate atomic action, when they split a one-page level
or delete the next-to-last page of a level.)

Any comments before I plunge into implementing this?

regards, tom lane

#2Justin Clift
justin@postgresql.org
In reply to: Tom Lane (#1)
Re: Brain dump: btree collapsing

Tom Lane wrote:
<snip>

The deletion procedure could be triggered immediately upon removal of the
last item in a page, or when the next VACUUM scan finds an empty page.
Not sure yet which way is better.

Having it triggered immediately upon removal of the last item in a page
would make for a more "self maintaining" system wouldn't it? That
sounds nice. :)

<snip>

In theory, if we find recyclable page(s) at the physical end of the index,
we could truncate the file (ie, give the space back to the filesystem)
instead of reporting these pages to FSM. I am not sure if this is worth
doing --- in most cases it's likely that little space can be released this
way, and there may be some tricky locking issues.

Sounds like this would be beneficial for environments with high
update/delete transaction volumes, perhaps on smaller amounts of
live/valid data.

<snip>

This could be ignored in first implementation (there's always REINDEX).
Later, possibly handle it via Lanin&Shasha's notion of a critic (think
VACUUM) that sets a fast pointer to the current effective root level.
(Actually I think we wouldn't need a separate critic process; split and
delete steps could be programmed to update the fast pointer for
themselves, in a separate atomic action, when they split a one-page level
or delete the next-to-last page of a level.)

This really sounds like good initial thoughts too.

:-)

Regards and best wishes,

Justin Clift

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Clift (#2)
Re: Brain dump: btree collapsing

Justin Clift <justin@postgresql.org> writes:

Tom Lane wrote:

The deletion procedure could be triggered immediately upon removal of the
last item in a page, or when the next VACUUM scan finds an empty page.
Not sure yet which way is better.

Having it triggered immediately upon removal of the last item in a page
would make for a more "self maintaining" system wouldn't it? That
sounds nice. :)

Maybe. This isn't about getting rid of VACUUM --- there's still a need
for routine maintenance vacuums. So the question really comes down to
whether it's more efficient to do it "in bulk" during routine
maintenance sweeps, or "retail". I'm not sold yet, but am leaning to
the bulk side.

In theory, if we find recyclable page(s) at the physical end of the index,
we could truncate the file (ie, give the space back to the filesystem)
instead of reporting these pages to FSM. I am not sure if this is worth
doing --- in most cases it's likely that little space can be released this
way, and there may be some tricky locking issues.

Sounds like this would be beneficial for environments with high
update/delete transaction volumes, perhaps on smaller amounts of
live/valid data.

It would only really be worth doing if you made a substantial reduction
in the number of rows in a table, and had no near-term intention of
loading the table back up again. Seems like it might be sufficient to
tell people to REINDEX if they want the index size to drop in that
scenario. I will look at physically truncating the index during VACUUM,
but I don't think it's worth getting real tense about...

regards, tom lane

#4Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: Brain dump: btree collapsing

I think having VACUUM record free index pages just like free heap pages
makes perfect sense, and is consistent.

This brings up one item it would be nice to address at the same time.
It would be nice if VACUUM FULL would be able to compress the actual
index file and return unused space to the operating system. REINDEX
does this, but I was thinking of something a little lighter that could
be done automatically as part of VACUUM FULL. If we can do that, it
would make consistent behavior for vacuum and heap/index files.

---------------------------------------------------------------------------

Tom Lane wrote:

Justin Clift <justin@postgresql.org> writes:

Tom Lane wrote:

The deletion procedure could be triggered immediately upon removal of the
last item in a page, or when the next VACUUM scan finds an empty page.
Not sure yet which way is better.

Having it triggered immediately upon removal of the last item in a page
would make for a more "self maintaining" system wouldn't it? That
sounds nice. :)

Maybe. This isn't about getting rid of VACUUM --- there's still a need
for routine maintenance vacuums. So the question really comes down to
whether it's more efficient to do it "in bulk" during routine
maintenance sweeps, or "retail". I'm not sold yet, but am leaning to
the bulk side.

In theory, if we find recyclable page(s) at the physical end of the index,
we could truncate the file (ie, give the space back to the filesystem)
instead of reporting these pages to FSM. I am not sure if this is worth
doing --- in most cases it's likely that little space can be released this
way, and there may be some tricky locking issues.

Sounds like this would be beneficial for environments with high
update/delete transaction volumes, perhaps on smaller amounts of
live/valid data.

It would only really be worth doing if you made a substantial reduction
in the number of rows in a table, and had no near-term intention of
loading the table back up again. Seems like it might be sufficient to
tell people to REINDEX if they want the index size to drop in that
scenario. I will look at physically truncating the index during VACUUM,
but I don't think it's worth getting real tense about...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: Brain dump: btree collapsing

Bruce Momjian <pgman@candle.pha.pa.us> writes:

It would be nice if VACUUM FULL would be able to compress the actual
index file and return unused space to the operating system. REINDEX
does this, but I was thinking of something a little lighter that could
be done automatically as part of VACUUM FULL.

But indexes tend to be very dependent on physical layout. You can't
just shove stuff around without thinking about the consequences.
Tables (heaps) are *much* more forgiving about that.

My feeling is that what we need to fix now is index bloat during normal
operation. If you want the indexes to actually *shrink*, that's a job
for REINDEX. Perhaps someday we can improve on that --- but let's not
blur our focus on the immediate problem.

regards, tom lane

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#5)
Re: Brain dump: btree collapsing

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

It would be nice if VACUUM FULL would be able to compress the actual
index file and return unused space to the operating system. REINDEX
does this, but I was thinking of something a little lighter that could
be done automatically as part of VACUUM FULL.

But indexes tend to be very dependent on physical layout. You can't
just shove stuff around without thinking about the consequences.
Tables (heaps) are *much* more forgiving about that.

My feeling is that what we need to fix now is index bloat during normal
operation. If you want the indexes to actually *shrink*, that's a job
for REINDEX. Perhaps someday we can improve on that --- but let's not
blur our focus on the immediate problem.

My point is only that while we need VACUUM and VACUUM FULL to match all
heap needs, we need a VACUUM FULL capability for indexes too. REINDEX
may be that capability, but it would be nice if we could compress out
some index space during VACUUM FULL.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7Daniel Kalchev
daniel@digsys.bg
In reply to: Tom Lane (#1)
Re: Brain dump: btree collapsing

Tom,

Sound excellent. Index growth has been something that always bothered me (not
the disk space usage, but the slow searches).

I believe it's best to have pages marked dead at the time the last key
contained in the page is deleted (you didn't discuss how efficient this is),
because this will somehow improve the three depth. The same functionality
should be available in VACUUM (just in case). Thus we should 'free' the index
pages with one VACUUM run, instead of two.

In the spirit of my ramblings about automatic statistics/suggestions by
PostgreSQL for optimizations, could you also implement a NOTICE when the index
becomes too 'thin'? I believe this will help avoid severe performance
degradation if the process of removing the dead tuples becomes automatic.

It also occurs to me, that if such statistics are available, PostgreSQL might
run VACUUM automatically, on specific tables/indexes - all this controlled by
a CUG variable.

Daniel

#8Daniel Kalchev
daniel@digsys.bg
In reply to: Justin Clift (#2)
Re: Brain dump: btree collapsing

Justin Clift said:

<snip>

In theory, if we find recyclable page(s) at the physical end of the index,
we could truncate the file (ie, give the space back to the filesystem)
instead of reporting these pages to FSM. I am not sure if this is worth
doing --- in most cases it's likely that little space can be released this
way, and there may be some tricky locking issues.

Sounds like this would be beneficial for environments with high
update/delete transaction volumes, perhaps on smaller amounts of
live/valid data.

But if dead pages are removed (returned to FSM?) as soon as last item is
removed from the page, the page usage will remain small. Or perhaps not?

That is, it's unlikely to collect large number of dead/free pages at the end
of the physical storage except if doing all this in single VACUUM session.

Daniel

#9Daniel Kalchev
daniel@digsys.bg
In reply to: Bruce Momjian (#4)
Re: Brain dump: btree collapsing

Bruce Momjian said:

This brings up one item it would be nice to address at the same time.
It would be nice if VACUUM FULL would be able to compress the actual
index file and return unused space to the operating system. REINDEX
does this, but I was thinking of something a little lighter that could
be done automatically as part of VACUUM FULL. If we can do that, it
would make consistent behavior for vacuum and heap/index files.

Since lazy VACUUM exists, I always wondered why VACUUM FULL doesn't run
REINDEX on a table where significant number of deleted tuples. VACUUM knows
those numbers - I always run REINDEX on larger tables that had huge number of
index entries deleted during VACUUM...

Daniel

#10Curtis Faith
curtis@galtcapital.com
In reply to: Daniel Kalchev (#9)
Re: Brain dump: btree collapsing

tom lane initially wrote:

Restructuring the tree during page deletion
-------------------------------------------

We will delete only completely-empty pages. If we were to
merge nearly-empty pages by moving data items from one page
to an adjacent one, this would imply changing the parent's
idea of the bounding key between them --- which is okay if we
are just deleting an internal key in the parent, but what if
the pages have different parent pages?

and a bit later wrote:

My feeling is that what we need to fix now is index bloat during
normal operation.

How about doing deletion of partial pages with reorganization among
sibling pages only (where the parent pages are the same)? This avoids
the "messiness" of propogating the deletes to differing parent pages but
gets most of the value of reorganization.

ISTM, that a VACUUM that only reclaims empty pages will be helpful in
certain cases but won't help much at all in many other common "normal
operation" cases which would be helped by partial reorganization.

- Curtis

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Kalchev (#7)
Re: Brain dump: btree collapsing

Daniel Kalchev <daniel@digsys.bg> writes:

... Thus we should 'free' the index
pages with one VACUUM run, instead of two.

Bear in mind that the only thing that deletes index entries is ... VACUUM.
Thus what we're really discussing is whether VACUUM should delete an
index page at the same time it deletes the last entry thereon, or in a
separate pass over the index. This is a minor implementation detail,
not something the user will really notice in terms of having to run
VACUUM multiple times to get the same effect.

Also, it will definitely take two VACUUM runs (minimum) before a
formerly-live page can enter FSM. Once the page is marked dead (in one
run) we still need to wait for the drain interval before we can give it
to FSM (in a later run). This is more or less analogous to the fact
that VACUUM can't remove recently-deleted heap tuples, even though they
are committed dead. You have to wait till there's no one left that
might want to access that tuple (or index page).

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curtis Faith (#10)
Re: Brain dump: btree collapsing

"Curtis Faith" <curtis@galtcapital.com> writes:

ISTM, that a VACUUM that only reclaims empty pages will be helpful in
certain cases but won't help much at all in many other common "normal
operation" cases which would be helped by partial reorganization.

Got any evidence to back that up? I was relying on

[Johnson89] Johnson, T. and Shasha, D. Utilization of B-trees with
Inserts, Deletes and Modifies ACM Symp. on PODS, 235-246, 1989.

which provides a ton of math and simulations leading up to the
conclusion that collapsing btree pages before they are fully empty
doesn't really gain anything meaningful in terms of storage utilization,
in most scenarios.

regards, tom lane

#13Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#12)
Re: Brain dump: btree collapsing

tom lane wrote:

Got any evidence to back that up? I was relying on

[Johnson89] Johnson, T. and Shasha, D. Utilization of
B-trees with Inserts, Deletes and Modifies ACM Symp. on
PODS, 235-246, 1989.

which provides a ton of math and simulations leading up to
the conclusion that collapsing btree pages before they are
fully empty doesn't really gain anything meaningful in terms
of storage utilization, in most scenarios.

Well, I don't have that specific paper handy but I do have [JS93]
Theodore Johnson , Dennis Shasha, "B-trees with inserts and deletes: why
free-at-empty is better than merge-at-half" which appears to be their
later thinking on the same subject.

Note the following:

"A merge-at-half B-tree will always have a space utilization of at least
50%. When all operations are modify operations, or when the number of
insert operations is the same as the number of delete operations, then
the utilization will be about 60%. In contrast, a free-at-empty B-tree
has a 0% lower bound on its space utilization, and will have about 39%
utilization on a pure-modify instruction mix. However, the space
utilization of a free-at-empty B-tree remains high if there are just a
few more insert operations than delete operations. Thus, merge-at-half
usually buys little in terms of space utilization.

In Figure 6, we showed that the restructuring rate of a merge-at-half
B-tree is significantly larger than the restructuring rate of a
free-at-empty B-tree for all values of q * :1. For many concurrent
B-tree algorithms used in practice [4, 13], restructuring causes a
serialization bottleneck. Thus, one simple but important way to increase
concurrency in B-tree operations is to reduce the probability of
restructuring. Since merge-at-half buys little space utilization but is
expensive in terms of restructuring, we recommend that B-trees
(especially concurrently accessed ones) use free-at-empty."

I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys.

However, as [Jannink]: "Implementing Deletion in B+-trees. SIGMOD
RECORD, v.24, n.1, p.33-38, 1995" points out that assumption doesn't
hold under other possibly common circumstances, specifically
circumstances where the deletes are taking place in significant sections
of the index at a much higher rate than inserts.

Consider from [Jannink95]:

"There has been some research on the acceptability of relaxing the
constraint of minimum node size to reduce the number of so-called unsafe
tree operations, i.e., those which contain node splitting and merging
[ZH89].

The debate has culminated in analysis of a weaker form of the deletion
algorithm which we call lazy deletion, that imposes no constraints on
the number of entries left in the nodes, allowing them to empty
completely before simply removing them. According to [GR93], most
database system implementations of B+-trees have adopted this approach.
Its most effective use is when it is vital to allow concurrent access to
the tree [JS93b], and excessive splitting and merging of nodes would
restrict concurrency. [JS89] derives some analytic solutions calculating
memory utilization for B+-trees under a mix of insertions and lazy
deletions, based on previous research which considered insertions only
[BY89]. The simulations in [JS89] support its analysis to show that in
typical situations, where deletions don't outnumber insertions in the
mix of operations, the tree nodes will contain acceptable percentages of

entries.

One of the work's assumptions [JS93a] is that the keys and tree
operations are chosen uniformly from a random distribution. This
assumption is unreasonable in certain realistic situations such as one
described below. Allowing interior nodes with only a single pointer to
exist in a B+-tree creates the possibility for arbitrarily unbalanced
trees of any height, which are virtually empty, and in which access
times have degenerated from the logarithmic bound B+-trees are meant to
guarantee to a worst case unbounded access time. Since nodes are not
removed until they are completely empty, the lazy deletion algorithm
does not regulate tree height effectively."

Jannink then illustrates an example where an index is created based on a
timestamp where the basic assumption of Johnson and Sasha does not hold
since it is not a random distribution but a monotonically increasing
value. His example is an extreme one but I believe there are many
instances where a timestamp, sequence or some other monotonically
increasing value is used in an index and where deletes are taking place
much more frequently for largely older values.

Since sequences are often used as foreign keys a significant number of
indexes fit into this category.

Consider a web site that tracked users and that deleted inactive
accounts. There are many real-world scenarios where the number of
inactive accounts is very high as a percentage of all accounts. Consider
an example where 50% of the accounts are inactive and deleted after 90
days of disuse and 90% are inactive after 180 days.

If the Account is implemented with an ID based on a sequence, any tables
that referenced the inactive account's sequence based ID via foreign
keys would have index entries that would be sparsely populated for the
older entries but might not have a high percentage of completely empty
index pages.

The older, but still active accounts would keep the index pages from
being completely empty in many cases.

Further, Johnson and Sasha make the claim that "free-at-empty is better"
in the specific context of restructuring causing a serialization
bottleneck. I don't think this applies to the specific case I
recommended where the parent is the same for both leaves, especially
during a VACUUM process where presumably we can optimize for concurrency
rather than the speed of VACUUM itself.

- Curtis

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curtis Faith (#13)
Re: Brain dump: btree collapsing

"Curtis Faith" <curtis@galtcapital.com> writes:

I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys. [But the random-distribution
assumption doesn't always hold.]

That's a fair point. Nonetheless, we have little choice: we cannot
move keys around during concurrent operations. If we push keys to
the right, we may cause an indexscan moving left to miss them, and
vice versa. So we can only eliminate empty pages.

We could possibly allow VACUUM FULL to collapse partly-full pages,
since in that case we know there are no concurrent scans.

regards, tom lane

#15Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#14)
Re: Brain dump: btree collapsing

tom lane wrote:

"Curtis Faith" <curtis@galtcapital.com> writes:

I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys. [But the random-distribution
assumption doesn't always hold.]

That's a fair point. Nonetheless, we have little choice: we
cannot move keys around during concurrent operations. If we
push keys to the right, we may cause an indexscan moving left
to miss them, and vice versa. So we can only eliminate empty pages.

We could possibly allow VACUUM FULL to collapse partly-full
pages, since in that case we know there are no concurrent scans.

Couldn't we do an exclusive lock call on both leaf pages and the parent
using a new LockBuffer type function, named something like
LockBufferNoWait, that uses LWLockConditionalAcquire instead of
LWLockAcquire, in the event that all three exclusive locks cannot be
obtained release all three locks, sleep, and try again for N retries.

(Actually, this would probably be four locks since the sibling pointer
of one of the siblings would have to be updated to point to the new
merged page instead of the to-be-deleted page.)

Having exclusive locks on all three pages prior to a merge would ensure
that no scans were interrupted during that merge.

Releasing all the exclusive locks in the event of failure to obtain any
of the locks will eliminate the possibility of creating deadlocks.

After N retries, VACCUUM could abort leaving the merge to be picked up
in another future VACUUM run.

I would think that this would be fairly easy to implement and that
except for very heavily scanned indexes, most of the time the locks
could be acquired and the merge would take place without causing any
concurrency problems.

- Curtis

#16Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#14)
Re: Brain dump: btree collapsing

Tom Lane kirjutas N, 13.02.2003 kell 20:10:

"Curtis Faith" <curtis@galtcapital.com> writes:

I don't dispute their conclusions in that context and under the
circumstances they outline of random distribution of deletion and
insertion values for the index keys. [But the random-distribution
assumption doesn't always hold.]

That's a fair point. Nonetheless, we have little choice: we cannot
move keys around during concurrent operations. If we push keys to
the right, we may cause an indexscan moving left to miss them, and
vice versa. So we can only eliminate empty pages.

But if we would allow the scans to find the same keys twice without ill
effects (as was suggested earlier, for using btrees to index arrays),
then we could possibly 1) copy the key to the right 2) wait for all
right-to-left scans that have fallen between old and new values to pass
and only then 3) delete the "old left" key.

This could solve the concurrency issue as well.

Show quoted text

We could possibly allow VACUUM FULL to collapse partly-full pages,
since in that case we know there are no concurrent scans.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#16)
Re: Brain dump: btree collapsing

Hannu Krosing <hannu@tm.ee> writes:

But if we would allow the scans to find the same keys twice without ill
effects (as was suggested earlier, for using btrees to index arrays),

How is returning the same data twice not an "ill effect"?

then we could possibly 1) copy the key to the right 2) wait for all
right-to-left scans that have fallen between old and new values to pass
and only then 3) delete the "old left" key.

How will you wait for scans that you know nothing of to go past?
Especially when they are going to be blocked by your own write lock
on the left page?

regards, tom lane

#18Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
Re: Brain dump: btree collapsing

Tom Lane wrote:

Hannu Krosing <hannu@tm.ee> writes:

But if we would allow the scans to find the same keys twice without ill
effects (as was suggested earlier, for using btrees to index arrays),

How is returning the same data twice not an "ill effect"?

then we could possibly 1) copy the key to the right 2) wait for all
right-to-left scans that have fallen between old and new values to pass
and only then 3) delete the "old left" key.

How will you wait for scans that you know nothing of to go past?
Especially when they are going to be blocked by your own write lock
on the left page?

I think we should skip any pages where we can't get a lock.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#19Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#17)
Re: Brain dump: btree collapsing

Tom Lane kirjutas R, 14.02.2003 kell 01:13:

Hannu Krosing <hannu@tm.ee> writes:

But if we would allow the scans to find the same keys twice without ill
effects (as was suggested earlier, for using btrees to index arrays),

How is returning the same data twice not an "ill effect"?

From earlier discussions I understood that there had been some work done

on using btrees for indexing arrays by storing each separate element in
a löeaf node. Surely that work must deal with not returning the same
tuple twice.

then we could possibly 1) copy the key to the right 2) wait for all
right-to-left scans that have fallen between old and new values to pass
and only then 3) delete the "old left" key.

How will you wait for scans that you know nothing of to go past?
Especially when they are going to be blocked by your own write lock
on the left page?

could we just not lock (for more than just to ensure atomic writes) the
page but instead increment a "page version" on each write to detect
changes?

------------
Hannu

#20Michael Paesold
mpaesold@gmx.at
In reply to: Curtis Faith (#13)
Re: Brain dump: btree collapsing

Hannu Krosing <hannu@tm.ee> wrote:

could we just not lock (for more than just to ensure atomic writes) the
page but instead increment a "page version" on each write to detect
changes?

Sounds like Index MVCC..., very nice. ;-)
(Of course I have no clue about feasibility, just liked the idea)

Best Regards,
Michael Paesold

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#19)
#22Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#21)
#23Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#1)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curtis Faith (#22)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#23)
#26Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#21)
#27Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#24)
#28Curtis Faith
curtis@galtcapital.com
In reply to: Curtis Faith (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curtis Faith (#26)
#30Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curtis Faith (#30)
#32Curtis Faith
curtis@galtcapital.com
In reply to: Tom Lane (#31)
#33Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#1)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#33)
#35Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#1)
#36Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#1)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#35)
#38Mark Woodward
pgsql@mohawksoft.com
In reply to: Tom Lane (#1)
#39Oleg Bartunov
oleg@sai.msu.su
In reply to: Tom Lane (#37)
#40Itai Zukerman
zukerman@math-hat.com
In reply to: Tom Lane (#37)
#41scott.marlowe
scott.marlowe@ihs.com
In reply to: Christopher Kings-Lynne (#36)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#41)
#43Bruce Momjian
bruce@momjian.us
In reply to: Christopher Kings-Lynne (#36)
#44Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Bruce Momjian (#43)