Faster inserts with mostly-monotonically increasing values

Started by Pavan Deolaseeover 8 years ago50 messageshackers
Jump to latest
#1Pavan Deolasee
pavan.deolasee@gmail.com

Hello All,

On a per-session basis, we cache the last heap block used for inserts and
try to use the same block for subsequent inserts. We don't do that for
indexes because the target block in the index is determined by the overall
structure of the index and the index key being inserted and hence caching
is usually not very useful. But for certain typical, yet not-so-uncommon
cases we can make similar optimisations. For example, consider a btree
index on a "timestamp" column or on a "serial" column. In such cases, each
new index entry typically goes beyond the rightmost entry in the index. If
we cache the rightmost block and check that first, we can avoid the cost of
descending down and looking up the index

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the
insert, it should still be the rightmost, the to-be-inserted key should go
into the cached block and there is enough free space in the block. If any
of these conditions do not meet, we fall back to the regular code path,
i.e. descent down the index from the top.

I benchmarked the patch by creating a simple table with just one bigint
column and a btree index on it. Here are the results:

Monotonically Increasing 10M Inserts (time in ms)
======================================
Run Patched Master
1 17656.222 25737.593
2 18765.002 26843.314
3 20629.567 27058.358
4 21174.998 26680.003
5 21118.865 26660.557

Avg 19868.9308 26595.965 (33.86% improvement)

Monotonically Increasing 100M Inserts (time in ms)
======================================
Run Patched Master
1 159861.58 248885.763
2 160138.432 256438.625
3 160426.135 250379.589
4 163218.405 249474.695
5 161125.523 247805.675

Avg 160954.015 250596.8694 (55.69% improvement)

So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

Patch credit: this work is based on Simon Riggs's original ideas and
research.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_btree_target_block_v1.patchapplication/octet-stream; name=pg_btree_target_block_v1.patchDownload+82-1
In reply to: Pavan Deolasee (#1)
Re: Faster inserts with mostly-monotonically increasing values

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the insert,
it should still be the rightmost, the to-be-inserted key should go into the
cached block and there is enough free space in the block. If any of these
conditions do not meet, we fall back to the regular code path, i.e. descent
down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)? I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

How many clients are involved with your benchmark?

So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

--
Peter Geoghegan

#3Andrey Borodin
amborodin@acm.org
In reply to: Pavan Deolasee (#1)
Re: Faster inserts with mostly-monotonically increasing values

Hi!

31 дек. 2017 г., в 11:44, Pavan Deolasee <pavan.deolasee@gmail.com> написал(а):
On a per-session basis, we cache the last heap block used for inserts and try to use the same block for subsequent inserts.

+1, from my POV this is good idea and it's cool that it already has implementation.
I'd suggest to do not 1\1 split for these pages, to keep tree lower. Like fillfactor/(1-fillfactor) split. Also, this will make splits less frequent.

Best regards, Andrey Borodin.

#4Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Faster inserts with mostly-monotonically increasing values

On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens to

be

in the rightmost block of an index, then we cache the block and check

that

first for the next insert. For the cached block to be usable for the

insert,

it should still be the rightmost, the to-be-inserted key should go into

the

cached block and there is enough free space in the block. If any of these
conditions do not meet, we fall back to the regular code path, i.e.

descent

down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)?

I thought we can throw in a bunch of checks there to ensure that we are
still looking at the correct rightmost page or else fallback to the regular
path. Do you see something missing there? I don't claim that I've got
everything correct, but since we're holding an exclusive lock on the page
at that point, wouldn't we be able to guard against any concurrent
splits/deletions? I will reread the documentation/code again, but I must
admit, it's not particularly easy to understand all the concurrency issues.

I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

Why would that happen? We are checking if the to-be-inserted key is
strictly greater than the last key in the rightmost block and only then
proceeding with the fast path. And we do this while holding an exclusive
lock on the page. Are you suggesting that someone can concurrently still
split the page?

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

This is a good idea, but we might need to track the block/LSN in shared
memory. Otherwise with multiple backends doing inserts, we might never be
able to match LSNs (i.e. they will mostly vary, thus falling back to slow
path for majority of the time)

How many clients are involved with your benchmark?

Just a single client for these benchmarks.

So as the size of the index increases, the benefits of the patch also

tend

to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

Hmm Ok. I am not entirely sure whether it's the depth or just purely binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#5Tels
nospam-pg-abuse@bloodgate.com
In reply to: Pavan Deolasee (#4)
Re: Faster inserts with mostly-monotonically increasing values

Moin,

On Tue, January 2, 2018 7:51 am, Pavan Deolasee wrote:

On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens

to
be

in the rightmost block of an index, then we cache the block and check

that

first for the next insert. For the cached block to be usable for the

insert,

it should still be the rightmost, the to-be-inserted key should go

into
the

cached block and there is enough free space in the block. If any of

these

conditions do not meet, we fall back to the regular code path, i.e.

descent

down the index from the top.

[snip]

So as the size of the index increases, the benefits of the patch also

tend

to increase. This is most likely because as the index becomes taller

and

taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

Hmm Ok. I am not entirely sure whether it's the depth or just purely
binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

@Peter: Could you please share your back-of-the-envelope calculation, I'd
love to get some insights into the innards.

All the best,

Tels

#6Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tels (#5)
Re: Faster inserts with mostly-monotonically increasing values

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

Moin,

Hmm Ok. I am not entirely sure whether it's the depth or just purely
binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will get
the remaining. The next split will happen when the right page is full i.e.
when another N/2 entries are added. Thus there will be a split at every N/2
inserts, creating an index with half filled pages.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Pavan Deolasee (#6)
Re: Faster inserts with mostly-monotonically increasing values

Pavan Deolasee wrote:

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will get
the remaining. The next split will happen when the right page is full i.e.
when another N/2 entries are added. Thus there will be a split at every N/2
inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

* If the page is the rightmost page on its level, we instead try to arrange
* to leave the left split page fillfactor% full. In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.

To answer Tels question: the tree is balanced by splitting pages and
propagating the splits upwards to the parent pages, all the way up to
the root. When the root page gets full, it is also split and a new root
is created on top of the old root plus its new sibling page, which is
the only point at which the tree grows in depth.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Tels
nospam-pg-abuse@bloodgate.com
In reply to: Alvaro Herrera (#7)
Re: Faster inserts with mostly-monotonically increasing values

Hello Alvaro,

On Thu, January 4, 2018 7:35 am, Alvaro Herrera wrote:

Pavan Deolasee wrote:

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com>
wrote:

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with

a

minimum (or at least not-as-high) number of levels, or does it

increase in

height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index
page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will
get
the remaining. The next split will happen when the right page is full
i.e.
when another N/2 entries are added. Thus there will be a split at every
N/2
inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

* If the page is the rightmost page on its level, we instead try to
arrange
* to leave the left split page fillfactor% full. In this way, when we
are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.

To answer Tels question: the tree is balanced by splitting pages and
propagating the splits upwards to the parent pages, all the way up to
the root. When the root page gets full, it is also split and a new root
is created on top of the old root plus its new sibling page, which is
the only point at which the tree grows in depth.

Thank you for the explanation! You learn something new every time :)

All the best,

Tels

#9Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Alvaro Herrera (#7)
Re: Faster inserts with mostly-monotonically increasing values

On Thu, Jan 4, 2018 at 6:05 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Pavan Deolasee wrote:

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com>

wrote:

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with

a

minimum (or at least not-as-high) number of levels, or does it

increase in

height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index

page

can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will

get

the remaining. The next split will happen when the right page is full

i.e.

when another N/2 entries are added. Thus there will be a split at every

N/2

inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

* If the page is the rightmost page on its level, we instead try to
arrange
* to leave the left split page fillfactor% full. In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.

Ah ok. Thanks for pointing that out. Makes a lot more sense.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Peter Geoghegan (#2)
Re: Faster inserts with mostly-monotonically increasing values

On 31 December 2017 at 11:06, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the insert,
it should still be the rightmost, the to-be-inserted key should go into the
cached block and there is enough free space in the block. If any of these
conditions do not meet, we fall back to the regular code path, i.e. descent
down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)?

I don't see any issues related to these points.

We wouldn't be inserting into a deleted/recycled page, so no issues to
discuss. We're not allocating a new page, so no recycling issues.

The block is tested for whether it is an incomplete split and also
locked, so it cannot split while locked.

I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

This isn't a problem either.

The patch only uses the cache when the value is higher than the
rightmost value. So by definition there are no duplicates in that
case.

Page splitting preserves the ordering property, so we are certain that
the rightmost block has the highest value.

We do need to check that the index is sorted ASC rather than DESC,
possibly with special action for NULL sorting.

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

Don't think we need that.

How many clients are involved with your benchmark?

Just one, since it is demonstrating the reduced path lenth of doing that.

So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

If they did, the gains would be even bigger.

But as you say, a unique index with many duplicates could be a
problem, and index fragmentation would make the situation worse.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Faster inserts with mostly-monotonically increasing values

On Sun, Dec 31, 2017 at 8:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

given...

+        if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+            !P_INCOMPLETE_SPLIT(lpageop) &&
+            !P_IGNORE(lpageop) &&
+            (PageGetFreeSpace(page) > itemsz) &&
+            _bt_compare(rel, natts, itup_scankey, page,
PageGetMaxOffsetNumber(page)) > 0)

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

That would allow this case to be applied not only to perfectly
monotonic sequences, but for nearly monotonic ones as well (think
timestamps).

In reply to: Claudio Freire (#11)
Re: Faster inserts with mostly-monotonically increasing values

On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

given...

+        if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+            !P_INCOMPLETE_SPLIT(lpageop) &&
+            !P_IGNORE(lpageop) &&
+            (PageGetFreeSpace(page) > itemsz) &&
+            _bt_compare(rel, natts, itup_scankey, page,
PageGetMaxOffsetNumber(page)) > 0)

One simple problem with this code is that it assumes that there must
be at least one item on a leaf page, which isn't guaranteed. There has
to be a highkey on non-righmost pages, but rightmost leaf pages can be
totally empty. While btvacuumpage() is very unlikely to not be able to
begin deleting the page there and then, it can happen (see remarks at
the end of btvacuumpage() about recursing).

Other issues spotted:

* The variable name "lpageop" seems like is was lifted from somewhere
dealing with a page to the left of some other page, which is not the
case here.

* P_INCOMPLETE_SPLIT() checks a bit that can only be set on a
non-rightmost page -- a page that has a sibling to its right that
doesn't have a downlink in parent. The bit is only ever set on the
"target" of a (still unfinished) page split. This is why
_bt_moveright() doesn't bother with completing a page split when it
reaches a rightmost page. I don't see why you need this part of the
test at all.

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

This code assumes that it can use block number as a stable way of
finding the same point in the B-Tree over time, without any interlock.
That may actually work out in practice, since even if the page is
recycled there can only ever be one rightmost leaf page, but it seems
like a brittle approach to me. The page image may be recycled by the
FSM already, and we really shouldn't be depending on the page not
looking a particular way once that has happened. I guess that you can
say the same thing about using page LSN to determine cached block
staleness instead, but that still seems a lot safer.

Basically, I'm worried that we could do something entirely
unpredictable when a page has actually been recycled, since we're
entirely opting out of the RecentGlobalXmin interlock business on the
assumption that we can be sure that recycling hasn't occurred in some
other way. Imagine what could happen if we ever teach the FSM to share
freespace among small relations, which seems quite possible to me.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

--
Peter Geoghegan

#13Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#12)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 9:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

This code assumes that it can use block number as a stable way of
finding the same point in the B-Tree over time, without any interlock.
That may actually work out in practice, since even if the page is
recycled there can only ever be one rightmost leaf page, but it seems
like a brittle approach to me. The page image may be recycled by the
FSM already, and we really shouldn't be depending on the page not
looking a particular way once that has happened. I guess that you can
say the same thing about using page LSN to determine cached block
staleness instead, but that still seems a lot safer.

Basically, I'm worried that we could do something entirely
unpredictable when a page has actually been recycled, since we're
entirely opting out of the RecentGlobalXmin interlock business on the
assumption that we can be sure that recycling hasn't occurred in some
other way. Imagine what could happen if we ever teach the FSM to share
freespace among small relations, which seems quite possible to me.

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

And it seems that assumption is correct in the patch as-is. In fact,
the current patch checks for a much stronger situation than needed to
apply this optimization, so it can even skip checking for conflicting
concurrent transactions. With an exclusive lock on the buffer, and
being the rightmost page, it means there can be no conflicting key
since the check is that the scan key is strictly greater than the
rightmost key (lets forget for a moment empty rightmost pages). Unless
there can be a new rightmost page in construction somehow (which I
don't believe it can happen, new rightmost pages would be created by
splitting the rightmost page, and that would conflict with the
exclusive lock), I don't see how this can fail.

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.

#14Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#13)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 9:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

That made me check, and:

+        /*
+         * Acquire exclusive lock on the buffer before doing any checks. This
+         * ensures that the index state cannot change, as far as the rightmost
+         * part of the index is concerned.
+         */
+        LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);

BTree code uses BT_READ and BT_WRITE instead, so that should be:

LockBuffer(buf, BT_WRITE)

In reply to: Claudio Freire (#13)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

You're mistaken. Read the nbtree README on page deletion, and the
RecentGlobalXmin interlock. Note also that there are 3 distinct page
states in play as far as deletion/recycling is concerned:

1. The first phase of deletion
2. The second phase of deletion.
3. Post-recycling, when in general the page could look like just about anything.

It's page deletion's job to make sure that an index scan cannot land
on a page after it was recycled. The corollary is that it is not
everyone else's job to make sure that that didn't happen -- how could
that possibly work?

Quite a few places know a bit and must reason about the the first
phase of deletion, and a few know about the second, but that's it.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

This can be quite fiddly. The downlink in the parent can be equal to
the value in the target page, meaning that we actually lock the
target's left sibling (see _bt_binsrch() special case for internal
pages).

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.

I didn't get what the point of checking the first item on the page
(your proposal) was.

--
Peter Geoghegan

#16Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#15)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 9:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

You're mistaken. Read the nbtree README on page deletion, and the
RecentGlobalXmin interlock. Note also that there are 3 distinct page
states in play as far as deletion/recycling is concerned:

1. The first phase of deletion
2. The second phase of deletion.
3. Post-recycling, when in general the page could look like just about anything.

It's page deletion's job to make sure that an index scan cannot land
on a page after it was recycled. The corollary is that it is not
everyone else's job to make sure that that didn't happen -- how could
that possibly work?

From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

For the third phase:

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.

That's because:

A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
parent, or scans moving right or left from one of the siblings). These
processes must observe that the page is marked dead and recover
accordingly. Searches and forward scans simply follow the right-link
until they find a non-dead page --- this will be where the deleted page's
key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

This can be quite fiddly. The downlink in the parent can be equal to
the value in the target page, meaning that we actually lock the
target's left sibling (see _bt_binsrch() special case for internal
pages).

I'm not sure which special case you're referring to, I don't see
anything relevant in _bt_binsrch.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

Any change to btree would make anyone nervous ;-)

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.

I didn't get what the point of checking the first item on the page
(your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Consider 3 transactions on a table like:

CREATE TABLE test ( id serial, d integer );

T1: INSERT INTO test (d) VALUES (3), (4), (5), (6);
T2: INSERT INTO test (d) SELECT generate_series(1,10);
T3: INSERT INTO test (d) VALUES (1);

Each transaction will take values from the sequence at some point of
the evaluation, far from actual insertion. This could cause the
following insertions:

-- (id, d)
T1: (5, 3)
T2: (10, 1)
T2: (11, 2)
T1: (6, 4)
T3: (1, 1)
... etc

No insert would be optimized in this case, and I don't think it's an
uncommon case.

So what I propose is that the optimization only needs to check whether
"the rightmost page" is the right page to insert on. And for that
check, all you have to assert is that:

* The page is actually the currently valid rightmost leaf
* The **first** (data) key on that page is strictly lesser than the scan key

If both are true, the insertion point will be somewhere on that page.
Maybe not at the end, maybe at the middle, but somewhere in that page,
no need to go looking any further in the b-tree.

If the page is empty, the optimization can't be applied, simple as
that. That check is missing on the patch, true enough.

In reply to: Claudio Freire (#16)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

True.

For the third phase:

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.

That's because:

A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
parent, or scans moving right or left from one of the siblings). These
processes must observe that the page is marked dead and recover
accordingly. Searches and forward scans simply follow the right-link
until they find a non-dead page --- this will be where the deleted page's
key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

That's pretty complicated.

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

Any change to btree would make anyone nervous ;-)

True. Anyway, this isn't the thing that I'm most concerned about right now.

I didn't get what the point of checking the first item on the page
(your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

--
Peter Geoghegan

#18Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#17)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 10:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

Point is, it's not brittle. The on-disk format of the tree is such
that any block can be catalogued by inspecting its header,
btvacuumscan depends on that. Questions that can be answered looking
at a page header alone, are:

* Is it deleted or new?
* Is it a leaf?
* Is it rightmost?

Only question remaining is whether it's the *only* rightmost leaf, and
that can be guaranteed by locking.

So, can a flag check result in a random outcome? No. That would also
cause a random outcome for btvacuumscan and then vacuum would corrupt
the index just as well.

And now that we mention this... why is the code not using _bt_getbuf?
It already does quite a bit of sanity checking that is IMO quite
desirable here.

#19Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Peter Geoghegan (#17)
Re: Faster inserts with mostly-monotonically increasing values

On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.

Regarding your suggestion about using page LSN to detect intermediate
activity, my concern is that unless we store that value in shared memory,
concurrent backends, even if inserting values in an order, will make
backend-local cached page LSN invalid and the optimisation will not
kick-in.

I am yet to digest the entire conversation between you and Claudio; you
guys clearly understand b-tree internals better than me. It seems while
you're worried about missing out on something, Claudio feels that we can
find a safe way just looking at the information available in the current
page. I feel the same way, but will need to re-read the discussion
carefully again.

Simon had raised concerns about DESC indexes and whether we need to do the
checks for leftmost page in that case. I haven't yet figured out if DESC
indexes are actually stored in the reverse order. I am gonna look at that
too.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In reply to: Claudio Freire (#18)
Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 7:10 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

VACUUM doesn't have to worry about concurrent page recycling because
it is already the only thing that performs page deletion. It's already
the process that has the exclusive right to give index pages back to
the FSM.

--
Peter Geoghegan

#21Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#20)
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#19)
#23Claudio Freire
klaussfreire@gmail.com
In reply to: Simon Riggs (#22)
#24Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Pavan Deolasee (#19)
#25Claudio Freire
klaussfreire@gmail.com
In reply to: Pavan Deolasee (#24)
#26Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Claudio Freire (#25)
#27Claudio Freire
klaussfreire@gmail.com
In reply to: Pavan Deolasee (#26)
#28Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Claudio Freire (#27)
#29Claudio Freire
klaussfreire@gmail.com
In reply to: Pavan Deolasee (#28)
#30Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Claudio Freire (#29)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#28)
#32Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#31)
#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#32)
#34Claudio Freire
klaussfreire@gmail.com
In reply to: Simon Riggs (#33)
#35Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#33)
#36Claudio Freire
klaussfreire@gmail.com
In reply to: Pavan Deolasee (#28)
#37Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#36)
#38Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Claudio Freire (#37)
#39Claudio Freire
klaussfreire@gmail.com
In reply to: Pavan Deolasee (#38)
#40Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#39)
#41Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Claudio Freire (#40)
#42Claudio Freire
klaussfreire@gmail.com
In reply to: Pavan Deolasee (#41)
#43Andrew Dunstan
andrew@dunslane.net
In reply to: Claudio Freire (#42)
#44Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Andrew Dunstan (#43)
#45Andrew Dunstan
andrew@dunslane.net
In reply to: Pavan Deolasee (#44)
#46Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Andrew Dunstan (#45)
#47Andrew Dunstan
andrew@dunslane.net
In reply to: Pavan Deolasee (#46)
#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrew Dunstan (#47)
#49Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#48)
#50Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Claudio Freire (#49)