Block B-Tree concept

Started by Heikki Linnakangasover 19 years ago39 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

I've been experimenting with the idea of a so-called Block B-Tree. The
basic idea is that instead of storing an index tuple for each heap
tuple, we store an index tuple for each heap block. This dramatically
reduces the size of an index, leading to savings on I/O. This idea was
briefly discussed in January:
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00565.php

To make it actually work, the semantics of the B-Tree has been modified
so that every index tuple represents 1 or more heap tuples that fall
within some range of values, and are on the same heap page. The range
that an index tuple represents is from X, inclusive, to Y, exclusive,
where X is the key of the index tuple and Y is the key of the *next*
index tuple in the index. If the heap is in index order (as after
CLUSTER), we get a very compact index this way, effectively eliminating
the leaf level of the B-tree.

To locate the actual matching items on the heap page, we have to scan
the heap page because we don't have the item ids, so this is a tradeoff
between CPU and I/O. However, we could have a hybrid where we initially
store heap tuple pointers like we do now, and when there's more than X
consecutive pointers to the same page, we collapse them to just one
pointer to the whole page. This would potentially give us the best of
both worlds.

This design is more flexible and less invasive than true
index-organized-tables, because it doesn't require perfect ordering of
the heap or moving heap tuples around. You can also define than one
Block B-Tree on a table, though you wouldn't get much benefit from using
Block B-Tree instead of normal B-Tree if there's no correlation between
the index order and the heap order.

It also fits in nicely with the bitmap scans, since there's already
support for "lossy" bitmap pages. Doing normal ordered index scans
requires some coding, but can be done.

Thoughts? I'm thinking of getting this in early in the 8.3 cycle. We'll
have to take a look at the indexam API to support both bitmap indexes
and block B-trees nicely.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Block B-Tree concept

Heikki Linnakangas <heikki@enterprisedb.com> writes:

I've been experimenting with the idea of a so-called Block B-Tree. The
basic idea is that instead of storing an index tuple for each heap
tuple, we store an index tuple for each heap block. This dramatically
reduces the size of an index, leading to savings on I/O.

VACUUM?

regards, tom lane

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#2)
Re: Block B-Tree concept

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

I've been experimenting with the idea of a so-called Block B-Tree. The
basic idea is that instead of storing an index tuple for each heap
tuple, we store an index tuple for each heap block. This dramatically
reduces the size of an index, leading to savings on I/O.

VACUUM?

There's a few options that I've thought of this far:

1. Whenever a tuple is found dead on page X, vacuum of the index will
have to go to that page again to see if there's any matching tuples left.

2. Have a reference counter on index tuple that's increased on insert
and decreased by vacuum.

3. Do nothing. Let index scans mark the index tuple as dead when it's
convenient. There's no correctness problem with just leaving dead index
tuples there, because you have to check the index quals on each heap
tuple anyway when you scan.

3. probably isn't an option in itself, but we might want to do some kind
of a mixture of 1 and 3.

I'm thinking that Block B-Tree is not a candidate for non-MVCC system
catalogs, but I think that's OK.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#3)
Re: Block B-Tree concept

Heikki Linnakangas wrote:

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

I've been experimenting with the idea of a so-called Block B-Tree. The
basic idea is that instead of storing an index tuple for each heap
tuple, we store an index tuple for each heap block. This dramatically
reduces the size of an index, leading to savings on I/O.

VACUUM?

There's a few options that I've thought of this far:

1. Whenever a tuple is found dead on page X, vacuum of the index will
have to go to that page again to see if there's any matching tuples left.

Right now, if an index entry points to a dead tuple, we set a bit in
the index so future lookups do not access the heap. We could set a bit
for block index entries that point to a page that has no live rows, and
have vacuum remove the index entry later.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Teodor Sigaev
teodor@sigaev.ru
In reply to: Bruce Momjian (#4)
Re: Block B-Tree concept

Right now, if an index entry points to a dead tuple, we set a bit in
the index so future lookups do not access the heap. We could set a bit
for block index entries that point to a page that has no live rows, and
have vacuum remove the index entry later.

GIN don't support this feature...
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#6Bruce Momjian
bruce@momjian.us
In reply to: Teodor Sigaev (#5)
Re: Block B-Tree concept

Teodor Sigaev wrote:

Right now, if an index entry points to a dead tuple, we set a bit in
the index so future lookups do not access the heap. We could set a bit
for block index entries that point to a page that has no live rows, and
have vacuum remove the index entry later.

GIN don't support this feature...

I think block indexes would only be for btrees.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Teodor Sigaev (#5)
Re: Block B-Tree concept

Teodor Sigaev wrote:

Right now, if an index entry points to a dead tuple, we set a bit in
the index so future lookups do not access the heap. We could set a
bit for block index entries that point to a page that has no live
rows, and
have vacuum remove the index entry later.

GIN don't support this feature...

I'm only talking about B-trees at this stage. ISTM that you could do the
same thing with hash indexes, but I haven't given it much thought.

Anyway, I think you'd usually want to use bitmap scans with a Block
B-tree, unless you need sorted output. And bitmap scans don't set the
LP_DELETE flag either. We might want to do something about that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#3)
Re: Block B-Tree concept

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Tom Lane wrote:

VACUUM?

There's a few options that I've thought of this far:

1. Whenever a tuple is found dead on page X, vacuum of the index will
have to go to that page again to see if there's any matching tuples left.

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

2. Have a reference counter on index tuple that's increased on insert
and decreased by vacuum.

The "increase on insert" part I understand, the "decrease by vacuum"
part seems to have the same problem as #1. How do you tell which index
entries should be changed?

3. Do nothing. Let index scans mark the index tuple as dead when it's
convenient. There's no correctness problem with just leaving dead index
tuples there, because you have to check the index quals on each heap
tuple anyway when you scan.

And we're back to routine REINDEX I guess :-(. This doesn't seem like a
satisfactory answer.

regards, tom lane

#9Csaba Nagy
nagy@ecircle-ag.com
In reply to: Tom Lane (#8)
Re: Block B-Tree concept

And we're back to routine REINDEX I guess :-(. This doesn't seem like a
satisfactory answer.

If the reindex works online, it could be a satisfactory solution.

Cheers,
Csaba.

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#8)
Re: Block B-Tree concept

Tom Lane wrote:

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

Umm, I'm afraid I have. Could you give me a clue?

3. Do nothing. Let index scans mark the index tuple as dead when it's
convenient. There's no correctness problem with just leaving dead index
tuples there, because you have to check the index quals on each heap
tuple anyway when you scan.

And we're back to routine REINDEX I guess :-(. This doesn't seem like a
satisfactory answer.

In general, it isn't.

Though there are interesting use cases where it would be fine. For
example, if you remove old data by dropping a partition, there's never
really need to vacuum. Or if all of the data is accessed during normal
operation, the index scans set the LP_DELETE flags and no additional
vacuum is really needed.

Also, now that we have concurrent CREATE INDEX, we could implement
concurrent REINDEX as well, I believe.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#10)
Re: Block B-Tree concept

Heikki Linnakangas wrote:

Tom Lane wrote:

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

Umm, I'm afraid I have. Could you give me a clue?

The patch by Hannu Krossing that allows a lazy vacuum to ignore other
lazy vacuums, when computing the OldestXmin.

The point is you can only ignore other lazy vacuums if they are not
going to visit the heap. Otherwise you'd risk removing a tuple that the
other vacuum wants to see.

You could here shout that surely no two vacuums could be trying to clean
the same table at the same time, but it turns out that you can have
functional indexes that scan other tables. Sure, it's a bad idea, but ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#10)
Re: Block B-Tree concept

Heikki Linnakangas wrote:

Tom Lane wrote:

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

Umm, I'm afraid I have. Could you give me a clue?

I think I found it. Is this what you're talking about (in
commands/vacuum.c):

/*
* During a lazy VACUUM we do not run any user-supplied functions,
* and so it should be safe to not create a transaction snapshot.
*
* We can furthermore set the inVacuum flag, which lets other
* concurrent VACUUMs know that they can ignore this one while
* determining their OldestXmin. (The reason we don't set inVacuum
* during a full VACUUM is exactly that we may have to run user-
* defined functions for functional indexes, and we want to make
* sure that if they use the snapshot set above, any tuples it
* requires can't get removed from other tables. An index function
* that depends on the contents of other tables is arguably broken,
* but we won't break it here by violating transaction semantics.)
*
* Note: the inVacuum flag remains set until CommitTransaction or
* AbortTransaction. We don't want to clear it until we reset
* MyProc->xid/xmin, else OldestXmin might appear to go backwards,
* which is probably Not Good.
*/
MyProc->inVacuum = true;

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#12)
Re: Block B-Tree concept

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Heikki Linnakangas wrote:

Tom Lane wrote:

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

I think I found it. Is this what you're talking about (in
commands/vacuum.c):

That's part of it, but I seem to recall other things too --- in
particular, the point about subtransactions troubles me. Whatever you
think about an index function looking at other tables, it is perfectly
legitimate to have an exception block in an index function, and that
requires subtransactions (at least as plpgsql is now implemented).

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#10)
Re: Block B-Tree concept

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Also, now that we have concurrent CREATE INDEX, we could implement
concurrent REINDEX as well, I believe.

That's probably more easily said than done --- in particular, I don't
understand what the committed state after the first transaction would
look like. CREATE INDEX can get away with it because nothing need be
depending on the new index, but you can't say that for an existing index
(esp. if it's UNIQUE).

regards, tom lane

#15Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Heikki Linnakangas (#1)
Re: Block B-Tree concept

On Tue, Sep 26, 2006 at 11:16:54AM +0100, Heikki Linnakangas wrote:

To locate the actual matching items on the heap page, we have to scan
the heap page because we don't have the item ids, so this is a tradeoff
between CPU and I/O. However, we could have a hybrid where we initially
store heap tuple pointers like we do now, and when there's more than X
consecutive pointers to the same page, we collapse them to just one
pointer to the whole page. This would potentially give us the best of
both worlds.

This design is more flexible and less invasive than true
index-organized-tables, because it doesn't require perfect ordering of
the heap or moving heap tuples around. You can also define than one
Block B-Tree on a table, though you wouldn't get much benefit from using
Block B-Tree instead of normal B-Tree if there's no correlation between
the index order and the heap order.

No, but I think there's scenarios where you may not have extremely high
correlation but you'd still benefit, especially with the hybrid
approach. If you have a field with rather skewed histogram, for example,
where you're likely to have a lot of tuples with one value on any given
page. Of course, you probably would want to exclude that value from the
index entirely if it's on *every* page, but anything where you see
paterns of data wouldn't be like that.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#16Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Heikki Linnakangas (#12)
Re: Block B-Tree concept

On Tue, Sep 26, 2006 at 05:27:56PM +0100, Heikki Linnakangas wrote:

Heikki Linnakangas wrote:

Tom Lane wrote:

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

Umm, I'm afraid I have. Could you give me a clue?

I think I found it. Is this what you're talking about (in
commands/vacuum.c):

/*
* During a lazy VACUUM we do not run any user-supplied functions,
* and so it should be safe to not create a transaction snapshot.
*
* We can furthermore set the inVacuum flag, which lets other
* concurrent VACUUMs know that they can ignore this one while
* determining their OldestXmin. (The reason we don't set inVacuum
* during a full VACUUM is exactly that we may have to run user-
* defined functions for functional indexes, and we want to make
* sure that if they use the snapshot set above, any tuples it
* requires can't get removed from other tables. An index function
* that depends on the contents of other tables is arguably broken,
* but we won't break it here by violating transaction semantics.)
*
* Note: the inVacuum flag remains set until CommitTransaction or
* AbortTransaction. We don't want to clear it until we reset
* MyProc->xid/xmin, else OldestXmin might appear to go backwards,
* which is probably Not Good.
*/
MyProc->inVacuum = true;

Do I understand that to mean that you can no longer lazy vacuum a
functional index?
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#17Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#8)
Re: Block B-Tree concept

On Tue, Sep 26, 2006 at 08:51:10AM -0400, Tom Lane wrote:

3. Do nothing. Let index scans mark the index tuple as dead when it's
convenient. There's no correctness problem with just leaving dead index
tuples there, because you have to check the index quals on each heap
tuple anyway when you scan.

And we're back to routine REINDEX I guess :-(. This doesn't seem like a
satisfactory answer.

Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
that anyway right now?

Granted, you'd want to periodically ensure that you scan the entire
index, but that shouldn't be horribly hard to set up.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jim Nasby (#16)
Re: Block B-Tree concept

Jim C. Nasby wrote:

Do I understand that to mean that you can no longer lazy vacuum a
functional index?

With the normal B-trees we have now, there's no problem vacuuming a
functional index. But it would be a problem with the Block B-tree,
because the proposed way of vacuuming a Block B-tree involves
re-evaluating index expressions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jim Nasby (#17)
Re: Block B-Tree concept

Jim C. Nasby wrote:

Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
that anyway right now?

You mean _index_ tuples marked dead? Sure, no problem there.

Granted, you'd want to periodically ensure that you scan the entire
index, but that shouldn't be horribly hard to set up.

Well, it seems to be. A vacuum can't evaluate index expressions because
it's not in a real transaction.

The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
etc. with enable_seqscan=false? That would work, but we can't depend on
an additional administrative task like. And we might as well just
disable the optimization that's causing us problems.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#19)
Re: Block B-Tree concept

Heikki Linnakangas wrote:

Jim C. Nasby wrote:

Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
that anyway right now?

You mean _index_ tuples marked dead? Sure, no problem there.

Granted, you'd want to periodically ensure that you scan the entire
index, but that shouldn't be horribly hard to set up.

Well, it seems to be. A vacuum can't evaluate index expressions because
it's not in a real transaction.

The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
etc. with enable_seqscan=false? That would work, but we can't depend on
an additional administrative task like. And we might as well just
disable the optimization that's causing us problems.

Why can't the C code just do a full index scan that touches the heap,
sets those expired bits, and then do a vacuum? My point is that the
bits can be set outside the normal vacuum process, so you don't have to
be doing heap lookups from the index inside vacuum.

Assuming the heap is mostly in index order, the full index scan
shouldn't take much longer than a heap scan, and if the heap doesn't
match index order, a block index shouldn't be used anyway.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
#24Csaba Nagy
nagy@ecircle-ag.com
In reply to: Tom Lane (#23)
#25Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#20)
#26Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#14)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#22)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#27)
#29Martijn van Oosterhout
kleptog@svana.org
In reply to: Heikki Linnakangas (#28)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Martijn van Oosterhout (#29)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#28)
#32Bruce Momjian
bruce@momjian.us
In reply to: Csaba Nagy (#24)
#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#28)
#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#33)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#34)
#37Jan de Visser
jdevisser@digitalfairway.com
In reply to: Heikki Linnakangas (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#36)
#39Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#38)