RE: btree split logic is fragile in the presence of lar ge index items

Started by Mikheev, Vadimover 25 years ago12 messages

vmikheev@SECTORBASE.COM

over 25 years ago

1. The PG 4.2 code used an OID appended to the key to ensure that
btree items are unique. This fixes the Lehman and Yao algorithm's
assumption of unique keys *for insertions*, since the item being
inserted must have an OID different from existing items even if the
key values are equal. However there is still an issue for *lookups*,
since we are necessarily doing a lookup with just a key value, and
we want to find all index items with that same key value regardless
of OID.

This is not right. OIDs were used *only* to find parent tuple in
_bt_getstackbuf, where only *per level* uniqueness is required.
I removed OIDs because of on any level there are no two (or more)
tuples pointing to the same place - i.e. TID may be used.
BTW, there were only single-key indices in Postgres-95 (and 4.2 too?) -
i.e. OID could not be used in key.

3. Awhile back Vadim removed the added-OID code and added a bunch of
logic for explicit management of chains of duplicate keys. In
retrospect this change was probably a mistake. For btree items that
point to heap tuples we can use the tuple TID as tie-breaker (since
an index should never contain two items with the same key and TID).
Btree internal pages need an added field since they have to
consider the TID as part of the key (and the field that is the TID in a
leaf page is used as the down-link pointer in non-leaf pages).

While implementing multi-key btree-s for 6.1 I found problems with
duplicates handling and this is why extra logic was added. But I never
was happy with this logic -:)

Note that using TID as part of key would give us additional feature:
fast heap tuple --> index tuple look up. With this feature vacuum wouldn't
have to read entire index to delete a few items... and this will be required
for space re-using without vacuum...

Vadim

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Mikheev, Vadim (#1)

Re: btree split logic is fragile in the presence of lar ge index items

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

While implementing multi-key btree-s for 6.1 I found problems with
duplicates handling and this is why extra logic was added. But I never
was happy with this logic -:)

Do you not like the proposal I was suggesting? I thought it was pretty
much what you said yourself a few months ago...

regards, tom lane

Mikheev, Vadim

vmikheev@SECTORBASE.COM

over 25 years ago

In reply to: Mikheev, Vadim (#1)

While implementing multi-key btree-s for 6.1 I found problems with
duplicates handling and this is why extra logic was added.
But I never was happy with this logic -:)

Do you not like the proposal I was suggesting? I thought it
was pretty much what you said yourself a few months ago...

Do not add TID to key but store key anywhere in duplicate chain and
just read lefter child page while positioning index scan, as we do
right now for partial keys?

This will result in additional reads but I like it much more than
current "logic"... especially keeping in mind that we'll have to
implement redo/undo for btree very soon and this would be nice
to simplify things -:) But if you're going to change btree then
please do it asap - I hope to begin btree redo/undo implementation
in 2-3 days, just after heap...

Vadim

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Mikheev, Vadim (#3)

Re: btree split logic is fragile in the presence of lar ge index items

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Do you not like the proposal I was suggesting? I thought it
was pretty much what you said yourself a few months ago...

Do not add TID to key but store key anywhere in duplicate chain and
just read lefter child page while positioning index scan, as we do
right now for partial keys?

This will result in additional reads but I like it much more than
current "logic"...

Offhand it seems good to me too. For the normal case where there are
many keys per page and not so many duplicates, an unneeded read should
be rare anyway.

I will need to study Lehman & Yao a little more to ensure they don't
have a problem with it, but if not I'll do it that way. (I was
surprised to realize that Lehman is the same Phil Lehman I knew in
grad school ... in fact he was probably working on this paper when
I knew him. Small world ain't it.)

But if you're going to change btree then
please do it asap - I hope to begin btree redo/undo implementation
in 2-3 days, just after heap...

Slavedriver ;-) ... I'll see what I can do ...

regards, tom lane

Hiroshi Inoue

Inoue@tpf.co.jp

over 25 years ago

In reply to: Tom Lane (#4)

RE: btree split logic is fragile in the presence of lar ge index items

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Tom Lane

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Do you not like the proposal I was suggesting? I thought it
was pretty much what you said yourself a few months ago...

Do not add TID to key but store key anywhere in duplicate chain and
just read lefter child page while positioning index scan, as we do
right now for partial keys?

This will result in additional reads but I like it much more than
current "logic"...

What about unique key insertions ?

Regards.

Hiroshi Inoue

Mikheev, Vadim

vmikheev@SECTORBASE.COM

over 25 years ago

In reply to: Mikheev, Vadim (#3)

Do not add TID to key but store key anywhere in duplicate
chain and just read lefter child page while positioning index scan,
as we do right now for partial keys?

This will result in additional reads but I like it much more than
current "logic"...

What about unique key insertions ?

We'll have to find leftmost key in this case and do what we do now.

Vadim

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Hiroshi Inoue (#5)

Re: btree split logic is fragile in the presence of lar ge index items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Do not add TID to key but store key anywhere in duplicate chain and
just read lefter child page while positioning index scan, as we do
right now for partial keys?

This will result in additional reads but I like it much more than
current "logic"...

What about unique key insertions ?

What about 'em? Doesn't seem to make any difference as far as I can
see. You still need to be prepared to scan right to see all of the
potential matches.

I have been digging in the index code and am thinking of reinstating
another aspect of the older implementation. Once upon a time, the code
was set up to treat the leftmost key on any non-leaf tree level as
minus-infinity. That seems to me to agree with the data structure
Lehman and Yao have in mind: in their drawings, each down-link pointer
on a non-leaf level is "between" two keys, except that the leftmost
downlink has no key to its left. Their drawings all show n+1 downlinks
in an n-key non-leaf node. We didn't match that representation, so we
need to fake it with a dummy key associated with the first downlink.
The original code did that, but it got taken out at some point and
replaced with pretty ugly code to propagate minimum-key changes back up
the tree when the leftmost child node has to be split. But I think the
only thing wrong with it was that not all the comparison routines knew
they needed to do that. btree seems to be suffering from an abundance
of comparison routines ... I'm going to try to eliminate some of the
redundancy.

Actually, we could shave some storage by not storing a key in the
first data item of any non-leaf page, whether leftmost on its level
or not. That would correspond exactly to L&Y's drawings.

regards, tom lane

Hiroshi Inoue

Inoue@tpf.co.jp

over 25 years ago

In reply to: Mikheev, Vadim (#6)

RE: btree split logic is fragile in the presence of lar ge index items

-----Original Message-----
From: Mikheev, Vadim

Do not add TID to key but store key anywhere in duplicate
chain and just read lefter child page while positioning index scan,
as we do right now for partial keys?

This will result in additional reads but I like it much more than
current "logic"...

What about unique key insertions ?

We'll have to find leftmost key in this case and do what we do now.

Currently the page contains the leftmost key is the target page of
insertion and is locked exclusively but it may be different in extra
TID implementation. There may be a very rare deadlock possibility.

Hiroshi Inoue
Inoue@tpf.co.jp

Mikheev, Vadim

vmikheev@SECTORBASE.COM

over 25 years ago

In reply to: Mikheev, Vadim (#6)

What about unique key insertions ?

We'll have to find leftmost key in this case and do what we do now.

Currently the page contains the leftmost key is the target page of
insertion and is locked exclusively but it may be different in extra
TID implementation. There may be a very rare deadlock possibility.

First, Tom is not going to do TID implementation now...
But anyway while we hold lock on a page we are able to go right
and lock pages (and we do this now). There is no possibility for
deadlock here: backward scans always unlock page before reading/locking
left page.

Vadim

Import Notes

Resolved by subject fallback

#10

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Mikheev, Vadim (#9)

Re: btree split logic is fragile in the presence of lar ge index items

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Currently the page contains the leftmost key is the target page of
insertion and is locked exclusively but it may be different in extra
TID implementation. There may be a very rare deadlock possibility.

But anyway while we hold lock on a page we are able to go right
and lock pages (and we do this now). There is no possibility for
deadlock here: backward scans always unlock page before reading/locking
left page.

Readers can only hold one page read lock at a time (they unlock before
trying to lock next page). Writers can hold more than one page lock
at a time, but they always step in the same direction (right or up)
so no deadlock is possible. It looks fine to me.

I have been studying the btree code all day today and finally understand
that the equal-key performance problems don't really have anything to do
with the Lehman-Yao assumption of no equal keys. The L-Y algorithm
really only needs unique keys so that a writer who's split a page can
return to the parent page and find the right place to insert the link
for the new righthand page. As Vadim says, you can use the downlinks
themselves to determine which entry is for the page you split; at worst
you have to do some linear instead of binary search when there are lots
of equal keys. That should be OK, since we only have to do this when we
split a page.

I believe that the equal-key performance problem largely comes from the
bogus way we've been handling duplicates, in particular the fact that
findsplitloc has to worry about choosing a "legal split point" for a
series of duplicates. Once we get rid of that, I think findsplitloc
can use a binary search.

regards, tom lane

#11

Mikheev, Vadim

vmikheev@SECTORBASE.COM

over 25 years ago

In reply to: Mikheev, Vadim (#9)

I believe that the equal-key performance problem largely
comes from the bogus way we've been handling duplicates, in particular

It comes from the way we look for page for new tuple - _bt_insertonpg tries
to avoid duplicate page splitting, i.e. if there is no space for new tuple
on the leftmost page, it reads next right page... etc...

the fact that findsplitloc has to worry about choosing a "legal split
point" for a series of duplicates. Once we get rid of that, I think
findsplitloc can use a binary search.

Vadim

Import Notes

Resolved by subject fallback

#12

Bruce Momjian

pgman@candle.pha.pa.us

over 25 years ago

In reply to: Mikheev, Vadim (#1)

Re: btree split logic is fragile in the presence of lar ge index items

Note that using TID as part of key would give us additional feature:
fast heap tuple --> index tuple look up. With this feature vacuum wouldn't
have to read entire index to delete a few items... and this will be required
for space re-using without vacuum...

Wow, this seems like a huge win.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026