Faster inserts with mostly-monotonically increasing values

Started by Pavan Deolaseeabout 8 years ago50 messages

pavan.deolasee@gmail.com

about 8 years ago

1 attachment(s)

Hello All,

On a per-session basis, we cache the last heap block used for inserts and
try to use the same block for subsequent inserts. We don't do that for
indexes because the target block in the index is determined by the overall
structure of the index and the index key being inserted and hence caching
is usually not very useful. But for certain typical, yet not-so-uncommon
cases we can make similar optimisations. For example, consider a btree
index on a "timestamp" column or on a "serial" column. In such cases, each
new index entry typically goes beyond the rightmost entry in the index. If
we cache the rightmost block and check that first, we can avoid the cost of
descending down and looking up the index

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the
insert, it should still be the rightmost, the to-be-inserted key should go
into the cached block and there is enough free space in the block. If any
of these conditions do not meet, we fall back to the regular code path,
i.e. descent down the index from the top.

I benchmarked the patch by creating a simple table with just one bigint
column and a btree index on it. Here are the results:

Monotonically Increasing 10M Inserts (time in ms)
======================================
Run Patched Master
1 17656.222 25737.593
2 18765.002 26843.314
3 20629.567 27058.358
4 21174.998 26680.003
5 21118.865 26660.557

Avg 19868.9308 26595.965 (33.86% improvement)

Monotonically Increasing 100M Inserts (time in ms)
======================================
Run Patched Master
1 159861.58 248885.763
2 160138.432 256438.625
3 160426.135 250379.589
4 163218.405 249474.695
5 161125.523 247805.675

Avg 160954.015 250596.8694 (55.69% improvement)

So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

Patch credit: this work is based on Simon Riggs's original ideas and
research.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_btree_target_block_v1.patchapplication/octet-stream; name=pg_btree_target_block_v1.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 310589d..72f500c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -23,6 +23,7 @@
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/smgr.h"
 #include "utils/tqual.h"
 
 
@@ -85,7 +86,6 @@ static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
-
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
@@ -118,6 +118,78 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
 
+	/*
+	 * It's very common to have an index on an auto-incremented or
+	 * monotonically increasing value. In such cases, every insertion happens
+	 * to the end of the index. We try to optimise that case by caching the
+	 * right-most block of the index and check if the new scankey falls beyond
+	 * the last key in the index. Furthermore we also check if the rightmost
+	 * page has enough free space to accommodate the new entry. If all these
+	 * conditions are satisfied then we can straight away insert the new key
+	 * and done with it.
+	 */
+	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+	{
+		Size 			itemsz;
+		Page			page;
+		BTPageOpaque	lpageop;
+
+		buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+
+		/*
+		 * Acquire exclusive lock on the buffer before doing any checks. This
+		 * ensures that the index state cannot change, as far as the rightmost
+		 * part of the index is concerned.
+		 */
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		itemsz = IndexTupleDSize(*itup);
+		itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
+									 * need to be consistent */
+
+		/*
+		 * Check if the page is still the rightmost leaf page, has enough free
+		 * space to accommodate the new tuple, no split is in progress and the
+		 * scankey is greater than the last key on the page.
+		 */
+		if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+			!P_INCOMPLETE_SPLIT(lpageop) &&
+			!P_IGNORE(lpageop) &&
+			(PageGetFreeSpace(page) > itemsz) &&
+			_bt_compare(rel, natts, itup_scankey, page, PageGetMaxOffsetNumber(page)) > 0)
+		{
+			offset = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+
+			/*
+			 * No need to search within the page since we've already confirmed
+			 * that the scankey is greater than the last key in this rightmost
+			 * page. But we do this additional check in assert-enabled build.
+			 */
+			Assert(offset == _bt_binsrch(rel, buf, natts, itup_scankey, true));
+
+			/* Ok to insert */
+			_bt_insertonpg(rel, buf, InvalidBuffer, NULL, itup, offset, false);
+			_bt_freeskey(itup_scankey);
+
+			/*
+			 * Since the scankey is greater than the last key on this rightmost
+			 * page and we're continuously holding exclusive lock on the page,
+			 * we can guarantee that it must be a unique key.
+			 */
+			return true;
+		}
+
+		UnlockReleaseBuffer(buf);
+
+		/*
+		 * Something did not workout. Just forget about the cached block and
+		 * follow the normal path. It might be set again if the conditions are
+		 * favourble.
+		 */
+		RelationSetTargetBlock(rel, InvalidBlockNumber);
+	}
+
 top:
 	/* find the first page containing this key */
 	stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE, NULL);
@@ -879,7 +951,16 @@ _bt_insertonpg(Relation rel,
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
 			if (P_ISLEAF(lpageop))
+			{
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+
+				/*
+				 * Cache the block information if we just inserted into the
+				 * rightmost leaf page of the index.
+				 */
+				if (P_RIGHTMOST(lpageop))
+					RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
+			}
 			else
 			{
 				/*

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Pavan Deolasee (#1)

Re: Faster inserts with mostly-monotonically increasing values

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the insert,
it should still be the rightmost, the to-be-inserted key should go into the
cached block and there is enough free space in the block. If any of these
conditions do not meet, we fall back to the regular code path, i.e. descent
down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)? I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

How many clients are involved with your benchmark?

So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

--
Peter Geoghegan

Andrey Borodin

x4mmm@yandex-team.ru

about 8 years ago

In reply to: Pavan Deolasee (#1)

Re: Faster inserts with mostly-monotonically increasing values

Hi!

31 дек. 2017 г., в 11:44, Pavan Deolasee <pavan.deolasee@gmail.com> написал(а):
On a per-session basis, we cache the last heap block used for inserts and try to use the same block for subsequent inserts.

+1, from my POV this is good idea and it's cool that it already has implementation.
I'd suggest to do not 1\1 split for these pages, to keep tree lower. Like fillfactor/(1-fillfactor) split. Also, this will make splits less frequent.

Best regards, Andrey Borodin.

Pavan Deolasee

pavan.deolasee@gmail.com

about 8 years ago

In reply to: Peter Geoghegan (#2)

Re: Faster inserts with mostly-monotonically increasing values

On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens to

be

in the rightmost block of an index, then we cache the block and check

that

first for the next insert. For the cached block to be usable for the

insert,

it should still be the rightmost, the to-be-inserted key should go into

the

cached block and there is enough free space in the block. If any of these
conditions do not meet, we fall back to the regular code path, i.e.

descent

down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)?

I thought we can throw in a bunch of checks there to ensure that we are
still looking at the correct rightmost page or else fallback to the regular
path. Do you see something missing there? I don't claim that I've got
everything correct, but since we're holding an exclusive lock on the page
at that point, wouldn't we be able to guard against any concurrent
splits/deletions? I will reread the documentation/code again, but I must
admit, it's not particularly easy to understand all the concurrency issues.

I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

Why would that happen? We are checking if the to-be-inserted key is
strictly greater than the last key in the rightmost block and only then
proceeding with the fast path. And we do this while holding an exclusive
lock on the page. Are you suggesting that someone can concurrently still
split the page?

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

This is a good idea, but we might need to track the block/LSN in shared
memory. Otherwise with multiple backends doing inserts, we might never be
able to match LSNs (i.e. they will mostly vary, thus falling back to slow
path for majority of the time)

How many clients are involved with your benchmark?

Just a single client for these benchmarks.

So as the size of the index increases, the benefits of the patch also

tend

to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

Hmm Ok. I am not entirely sure whether it's the depth or just purely binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tels

nospam-pg-abuse@bloodgate.com

about 8 years ago

In reply to: Pavan Deolasee (#4)

Re: Faster inserts with mostly-monotonically increasing values

Moin,

On Tue, January 2, 2018 7:51 am, Pavan Deolasee wrote:

On Sun, Dec 31, 2017 at 4:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens

to
be

in the rightmost block of an index, then we cache the block and check

that

first for the next insert. For the cached block to be usable for the

insert,

it should still be the rightmost, the to-be-inserted key should go

into
the

cached block and there is enough free space in the block. If any of

these

conditions do not meet, we fall back to the regular code path, i.e.

descent

down the index from the top.

[snip]

So as the size of the index increases, the benefits of the patch also

tend

to increase. This is most likely because as the index becomes taller

and

taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

Hmm Ok. I am not entirely sure whether it's the depth or just purely
binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

@Peter: Could you please share your back-of-the-envelope calculation, I'd
love to get some insights into the innards.

All the best,

Tels

Pavan Deolasee

pavan.deolasee@gmail.com

about 8 years ago

In reply to: Tels (#5)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

Moin,

Hmm Ok. I am not entirely sure whether it's the depth or just purely
binary
searching through 3-4 index pages and/or pinning/unpinning buffers result
in much overhead. I will run some more tests and collect evidences.

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will get
the remaining. The next split will happen when the right page is full i.e.
when another N/2 entries are added. Thus there will be a split at every N/2
inserts, creating an index with half filled pages.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Alvaro Herrera

alvherre@alvh.no-ip.org

about 8 years ago

In reply to: Pavan Deolasee (#6)

Re: Faster inserts with mostly-monotonically increasing values

Pavan Deolasee wrote:

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with a
minimum (or at least not-as-high) number of levels, or does it increase in
height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will get
the remaining. The next split will happen when the right page is full i.e.
when another N/2 entries are added. Thus there will be a split at every N/2
inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

* If the page is the rightmost page on its level, we instead try to arrange
* to leave the left split page fillfactor% full. In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.

To answer Tels question: the tree is balanced by splitting pages and
propagating the splits upwards to the parent pages, all the way up to
the root. When the root page gets full, it is also split and a new root
is created on top of the old root plus its new sibling page, which is
the only point at which the tree grows in depth.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tels

nospam-pg-abuse@bloodgate.com

about 8 years ago

In reply to: Alvaro Herrera (#7)

Re: Faster inserts with mostly-monotonically increasing values

Hello Alvaro,

On Thu, January 4, 2018 7:35 am, Alvaro Herrera wrote:

Pavan Deolasee wrote:

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com>
wrote:

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with

a

minimum (or at least not-as-high) number of levels, or does it

increase in

height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index
page
can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will
get
the remaining. The next split will happen when the right page is full
i.e.
when another N/2 entries are added. Thus there will be a split at every
N/2
inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

* If the page is the rightmost page on its level, we instead try to
arrange
* to leave the left split page fillfactor% full. In this way, when we
are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.

To answer Tels question: the tree is balanced by splitting pages and
propagating the splits upwards to the parent pages, all the way up to
the root. When the root page gets full, it is also split and a new root
is created on top of the old root plus its new sibling page, which is
the only point at which the tree grows in depth.

Thank you for the explanation! You learn something new every time :)

All the best,

Tels

Pavan Deolasee

pavan.deolasee@gmail.com

about 8 years ago

In reply to: Alvaro Herrera (#7)

Re: Faster inserts with mostly-monotonically increasing values

On Thu, Jan 4, 2018 at 6:05 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Pavan Deolasee wrote:

On Tue, Jan 2, 2018 at 7:15 PM, Tels <nospam-pg-abuse@bloodgate.com>

wrote:

Just a question trying to understand how btree indexes work:

If one inserts ever-increasing value, is the tree a balanced tree with

a

minimum (or at least not-as-high) number of levels, or does it

increase in

height every insert and creates a "tall stack"?

AFAIK all pages will be half-filled in this case. Imagine if one index

page

can accommodate N entries, the page will be split when (N+1)th entry is
added. The left page will have half the entries and the right page will

get

the remaining. The next split will happen when the right page is full

i.e.

when another N/2 entries are added. Thus there will be a split at every

N/2

inserts, creating an index with half filled pages.

Not really -- quoth _bt_findsplitloc():

* If the page is the rightmost page on its level, we instead try to
arrange
* to leave the left split page fillfactor% full. In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.

Ah ok. Thanks for pointing that out. Makes a lot more sense.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Simon Riggs

simon.riggs@2ndquadrant.com

almost 8 years ago

In reply to: Peter Geoghegan (#2)

Re: Faster inserts with mostly-monotonically increasing values

On 31 December 2017 at 11:06, Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 31, 2017 at 6:44 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Here is a patch that implements the idea. If the last insert happens to be
in the rightmost block of an index, then we cache the block and check that
first for the next insert. For the cached block to be usable for the insert,
it should still be the rightmost, the to-be-inserted key should go into the
cached block and there is enough free space in the block. If any of these
conditions do not meet, we fall back to the regular code path, i.e. descent
down the index from the top.

I'd be particularly concerned about page deletion/page recycling still
being correct with the patch, especially because it's hard to test
anything there. The P_ISLEAF() part of your fastpath's test hints at
this -- why should it not *always* be a leaf page (surely you should
be sure that the page cannot be recycled anyway)?

I don't see any issues related to these points.

We wouldn't be inserting into a deleted/recycled page, so no issues to
discuss. We're not allocating a new page, so no recycling issues.

The block is tested for whether it is an incomplete split and also
locked, so it cannot split while locked.

I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

This isn't a problem either.

The patch only uses the cache when the value is higher than the
rightmost value. So by definition there are no duplicates in that
case.

Page splitting preserves the ordering property, so we are certain that
the rightmost block has the highest value.

We do need to check that the index is sorted ASC rather than DESC,
possibly with special action for NULL sorting.

Maybe you should do something with page LSN here -- cache LSN
alongside block number, and have a non-matching LSN invalidate the
test.

Don't think we need that.

How many clients are involved with your benchmark?

Just one, since it is demonstrating the reduced path lenth of doing that.

So as the size of the index increases, the benefits of the patch also tend
to increase. This is most likely because as the index becomes taller and
taller, the costs associated with index descent becomes higher.

FWIW, I think that int4 indexes have to be extremely large before they
exceed 4 levels (where the leaf level counts as a level). My
conservative back-of-an-envelope estimate is 9 billion index tuples.

If they did, the gains would be even bigger.

But as you say, a unique index with many duplicates could be a
problem, and index fragmentation would make the situation worse.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#2)

Re: Faster inserts with mostly-monotonically increasing values

On Sun, Dec 31, 2017 at 8:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I also have my
doubts about unique index enforcement remaining correct with the patch
when there are many physical duplicates, to the extent that more than
a single leaf page is needed for a single value.

given...

+        if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+            !P_INCOMPLETE_SPLIT(lpageop) &&
+            !P_IGNORE(lpageop) &&
+            (PageGetFreeSpace(page) > itemsz) &&
+            _bt_compare(rel, natts, itup_scankey, page,
PageGetMaxOffsetNumber(page)) > 0)

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

That would allow this case to be applied not only to perfectly
monotonic sequences, but for nearly monotonic ones as well (think
timestamps).

#12

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Claudio Freire (#11)

Re: Faster inserts with mostly-monotonically increasing values

On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

given...

+        if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+            !P_INCOMPLETE_SPLIT(lpageop) &&
+            !P_IGNORE(lpageop) &&
+            (PageGetFreeSpace(page) > itemsz) &&
+            _bt_compare(rel, natts, itup_scankey, page,
PageGetMaxOffsetNumber(page)) > 0)

One simple problem with this code is that it assumes that there must
be at least one item on a leaf page, which isn't guaranteed. There has
to be a highkey on non-righmost pages, but rightmost leaf pages can be
totally empty. While btvacuumpage() is very unlikely to not be able to
begin deleting the page there and then, it can happen (see remarks at
the end of btvacuumpage() about recursing).

Other issues spotted:

* The variable name "lpageop" seems like is was lifted from somewhere
dealing with a page to the left of some other page, which is not the
case here.

* P_INCOMPLETE_SPLIT() checks a bit that can only be set on a
non-rightmost page -- a page that has a sibling to its right that
doesn't have a downlink in parent. The bit is only ever set on the
"target" of a (still unfinished) page split. This is why
_bt_moveright() doesn't bother with completing a page split when it
reaches a rightmost page. I don't see why you need this part of the
test at all.

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

This code assumes that it can use block number as a stable way of
finding the same point in the B-Tree over time, without any interlock.
That may actually work out in practice, since even if the page is
recycled there can only ever be one rightmost leaf page, but it seems
like a brittle approach to me. The page image may be recycled by the
FSM already, and we really shouldn't be depending on the page not
looking a particular way once that has happened. I guess that you can
say the same thing about using page LSN to determine cached block
staleness instead, but that still seems a lot safer.

Basically, I'm worried that we could do something entirely
unpredictable when a page has actually been recycled, since we're
entirely opting out of the RecentGlobalXmin interlock business on the
assumption that we can be sure that recycling hasn't occurred in some
other way. Imagine what could happen if we ever teach the FSM to share
freespace among small relations, which seems quite possible to me.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

--
Peter Geoghegan

#13

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#12)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 9:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

This code assumes that it can use block number as a stable way of
finding the same point in the B-Tree over time, without any interlock.
That may actually work out in practice, since even if the page is
recycled there can only ever be one rightmost leaf page, but it seems
like a brittle approach to me. The page image may be recycled by the
FSM already, and we really shouldn't be depending on the page not
looking a particular way once that has happened. I guess that you can
say the same thing about using page LSN to determine cached block
staleness instead, but that still seems a lot safer.

Basically, I'm worried that we could do something entirely
unpredictable when a page has actually been recycled, since we're
entirely opting out of the RecentGlobalXmin interlock business on the
assumption that we can be sure that recycling hasn't occurred in some
other way. Imagine what could happen if we ever teach the FSM to share
freespace among small relations, which seems quite possible to me.

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

And it seems that assumption is correct in the patch as-is. In fact,
the current patch checks for a much stronger situation than needed to
apply this optimization, so it can even skip checking for conflicting
concurrent transactions. With an exclusive lock on the buffer, and
being the rightmost page, it means there can be no conflicting key
since the check is that the scan key is strictly greater than the
rightmost key (lets forget for a moment empty rightmost pages). Unless
there can be a new rightmost page in construction somehow (which I
don't believe it can happen, new rightmost pages would be created by
splitting the rightmost page, and that would conflict with the
exclusive lock), I don't see how this can fail.

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.

#14

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#13)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 9:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

That made me check, and:

+        /*
+         * Acquire exclusive lock on the buffer before doing any checks. This
+         * ensures that the index state cannot change, as far as the rightmost
+         * part of the index is concerned.
+         */
+        LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);

BTree code uses BT_READ and BT_WRITE instead, so that should be:

LockBuffer(buf, BT_WRITE)

#15

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Claudio Freire (#13)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

You're mistaken. Read the nbtree README on page deletion, and the
RecentGlobalXmin interlock. Note also that there are 3 distinct page
states in play as far as deletion/recycling is concerned:

1. The first phase of deletion
2. The second phase of deletion.
3. Post-recycling, when in general the page could look like just about anything.

It's page deletion's job to make sure that an index scan cannot land
on a page after it was recycled. The corollary is that it is not
everyone else's job to make sure that that didn't happen -- how could
that possibly work?

Quite a few places know a bit and must reason about the the first
phase of deletion, and a few know about the second, but that's it.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

This can be quite fiddly. The downlink in the parent can be equal to
the value in the target page, meaning that we actually lock the
target's left sibling (see _bt_binsrch() special case for internal
pages).

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.

I didn't get what the point of checking the first item on the page
(your proposal) was.

--
Peter Geoghegan

#16

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#15)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 9:52 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

You're mistaken. Read the nbtree README on page deletion, and the
RecentGlobalXmin interlock. Note also that there are 3 distinct page
states in play as far as deletion/recycling is concerned:

1. The first phase of deletion
2. The second phase of deletion.
3. Post-recycling, when in general the page could look like just about anything.

It's page deletion's job to make sure that an index scan cannot land
on a page after it was recycled. The corollary is that it is not
everyone else's job to make sure that that didn't happen -- how could
that possibly work?

From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

For the third phase:

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.

That's because:

A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
parent, or scans moving right or left from one of the siblings). These
processes must observe that the page is marked dead and recover
accordingly. Searches and forward scans simply follow the right-link
until they find a non-dead page --- this will be where the deleted page's
key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

The whole way that this patch opts out of using an exclusive buffer
lock on the "first page the value could be on" (the _bt_check_unique()
+ _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

This can be quite fiddly. The downlink in the parent can be equal to
the value in the target page, meaning that we actually lock the
target's left sibling (see _bt_binsrch() special case for internal
pages).

I'm not sure which special case you're referring to, I don't see
anything relevant in _bt_binsrch.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

Any change to btree would make anyone nervous ;-)

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.

I didn't get what the point of checking the first item on the page
(your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Consider 3 transactions on a table like:

CREATE TABLE test ( id serial, d integer );

T1: INSERT INTO test (d) VALUES (3), (4), (5), (6);
T2: INSERT INTO test (d) SELECT generate_series(1,10);
T3: INSERT INTO test (d) VALUES (1);

Each transaction will take values from the sequence at some point of
the evaluation, far from actual insertion. This could cause the
following insertions:

-- (id, d)
T1: (5, 3)
T2: (10, 1)
T2: (11, 2)
T1: (6, 4)
T3: (1, 1)
... etc

No insert would be optimized in this case, and I don't think it's an
uncommon case.

So what I propose is that the optimization only needs to check whether
"the rightmost page" is the right page to insert on. And for that
check, all you have to assert is that:

* The page is actually the currently valid rightmost leaf
* The **first** (data) key on that page is strictly lesser than the scan key

If both are true, the insertion point will be somewhere on that page.
Maybe not at the end, maybe at the middle, but somewhere in that page,
no need to go looking any further in the b-tree.

If the page is empty, the optimization can't be applied, simple as
that. That check is missing on the patch, true enough.

#17

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Claudio Freire (#16)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

True.

For the third phase:

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.

That's because:

A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
parent, or scans moving right or left from one of the siblings). These
processes must observe that the page is marked dead and recover
accordingly. Searches and forward scans simply follow the right-link
until they find a non-dead page --- this will be where the deleted page's
key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

That's pretty complicated.

I didn't say that the patch is necessarily wrong here. Just that it
makes me nervous.

Any change to btree would make anyone nervous ;-)

True. Anyway, this isn't the thing that I'm most concerned about right now.

I didn't get what the point of checking the first item on the page
(your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

--
Peter Geoghegan

#18

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#17)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 10:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

Point is, it's not brittle. The on-disk format of the tree is such
that any block can be catalogued by inspecting its header,
btvacuumscan depends on that. Questions that can be answered looking
at a page header alone, are:

* Is it deleted or new?
* Is it a leaf?
* Is it rightmost?

Only question remaining is whether it's the *only* rightmost leaf, and
that can be guaranteed by locking.

So, can a flag check result in a random outcome? No. That would also
cause a random outcome for btvacuumscan and then vacuum would corrupt
the index just as well.

And now that we mention this... why is the code not using _bt_getbuf?
It already does quite a bit of sanity checking that is IMO quite
desirable here.

#19

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#17)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.

Regarding your suggestion about using page LSN to detect intermediate
activity, my concern is that unless we store that value in shared memory,
concurrent backends, even if inserting values in an order, will make
backend-local cached page LSN invalid and the optimisation will not
kick-in.

I am yet to digest the entire conversation between you and Claudio; you
guys clearly understand b-tree internals better than me. It seems while
you're worried about missing out on something, Claudio feels that we can
find a safe way just looking at the information available in the current
page. I feel the same way, but will need to re-read the discussion
carefully again.

Simon had raised concerns about DESC indexes and whether we need to do the
checks for leftmost page in that case. I haven't yet figured out if DESC
indexes are actually stored in the reverse order. I am gonna look at that
too.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Peter Geoghegan

pg@bowt.ie

almost 8 years ago

In reply to: Claudio Freire (#18)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 5, 2018 at 7:10 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

VACUUM doesn't have to worry about concurrent page recycling because
it is already the only thing that performs page deletion. It's already
the process that has the exclusive right to give index pages back to
the FSM.

--
Peter Geoghegan

#21

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Peter Geoghegan (#20)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Mar 6, 2018 at 1:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 7:10 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Introducing any case that allows us to land on a recycled page, and
reason that it must at least not be the page we were looking for based
on *any* criteria about the page itself seems very brittle. Yeah, it
probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

VACUUM doesn't have to worry about concurrent page recycling because
it is already the only thing that performs page deletion. It's already
the process that has the exclusive right to give index pages back to
the FSM.

Not *concurrent* recycling, but it does have to handle *recycled*
pages correctly.

With the write lock, I don't think there's an issue with *concurrent*
recycling. According to the readme, leaf pages aren't deleted at all,
so only concurrent splitting is a concern.

The only issue is that we may land on a *recycled* page. Now, it
depends on what you mean by recycled here... if you mean "deleted",
!P_IGNORE will skip that page. If you mean it was deleted and freed
but not reused, again, !P_IGNORE will skip it (much as it happens for
vacuum). If you mean that someone reused it, then it will be a valid
page with valid headers, so we need not worry about it not being
consistent. It can be some other page than the one we expect it to be,
but the checks ought to be sufficient to quickly verify whether that's
the case.

Unless you see a way in which a write-locked page could say it is a
rightmost leaf when it is not?

#22

Simon Riggs

simon.riggs@2ndquadrant.com

almost 8 years ago

In reply to: Pavan Deolasee (#19)

Re: Faster inserts with mostly-monotonically increasing values

On 6 March 2018 at 04:40, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.

Even if the optimization is only valid with a single backend, its
still very useful - think COPY, but other similar cases.

Regarding your suggestion about using page LSN to detect intermediate
activity, my concern is that unless we store that value in shared memory,
concurrent backends, even if inserting values in an order, will make
backend-local cached page LSN invalid and the optimisation will not kick-in.

I am yet to digest the entire conversation between you and Claudio; you guys
clearly understand b-tree internals better than me. It seems while you're
worried about missing out on something, Claudio feels that we can find a
safe way just looking at the information available in the current page. I
feel the same way, but will need to re-read the discussion carefully again.

Simon had raised concerns about DESC indexes and whether we need to do the
checks for leftmost page in that case. I haven't yet figured out if DESC
indexes are actually stored in the reverse order. I am gonna look at that
too.

No, I meant that you were testing whether the value was higher (> 0),
whereas it should be lower (< 0) on DESC indexes.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Simon Riggs (#22)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Mar 6, 2018 at 9:06 AM, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

Simon had raised concerns about DESC indexes and whether we need to do the
checks for leftmost page in that case. I haven't yet figured out if DESC
indexes are actually stored in the reverse order. I am gonna look at that
too.

No, I meant that you were testing whether the value was higher (> 0),
whereas it should be lower (< 0) on DESC indexes.

Isn't that already handled by _bt_compare?

#24

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#19)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.

So I repeated the tests with 1,2,4 and 8 clients, each running the
following statement and a total of 1024 transactions. So roughly 100M rows
are inserted.

INSERT INTO testtab(b) SELECT generate_series(1,100000);

After taking average of 3-runs:

+---------+--------------------------------+-------------------------------+
| clients | Patched - time in sec     | Master - time in sec |
+---------+--------------------------------+-------------------------------+
| 1       | 311.8643602                    | 411.832757                    |
+---------+--------------------------------+-------------------------------+
| 2       | 252.5433                       | 300.7875613                   |
+---------+--------------------------------+-------------------------------+
| 4       | 337.0414279                    | 350.9636766                   |
+---------+--------------------------------+-------------------------------+
| 8       | 444.2035582                    | 477.1903417                   |
+---------+--------------------------------+-------------------------------+

So yes, the benefits of the patch go down with higher number of clients,
but it does not entirely vanish.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#25

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#24)

Re: Faster inserts with mostly-monotonically increasing values

On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Pavan justified the patch using a benchmark that only involved a
single client -- hardly typical for a patch that changes the B-Tree
code. If the benefits with many clients can be shown to matter, that
will make this much more interesting to me.

Ok. I will repeat those tests with more number of clients and report back.

So I repeated the tests with 1,2,4 and 8 clients, each running the following
statement and a total of 1024 transactions. So roughly 100M rows are
inserted.

INSERT INTO testtab(b) SELECT generate_series(1,100000);

The table definition is:
postgres=# \d+ testtab
Table "public.testtab"
Column | Type | Collation | Nullable | Default
| Storage | Stats target | Description
--------+--------+-----------+----------+------------------------------------+---------+--------------+-------------
a | bigint | | not null | nextval('testtab_a_seq'::regclass)
| plain | |
b | bigint | | |
| plain | |
Indexes:
"testtab_a_key" UNIQUE CONSTRAINT, btree (a)

After taking average of 3-runs:
+---------+--------------------------------+-------------------------------+
| clients | Patched - time in sec     | Master - time in sec |
+---------+--------------------------------+-------------------------------+
| 1       | 311.8643602                    | 411.832757                    |
+---------+--------------------------------+-------------------------------+
| 2       | 252.5433                       | 300.7875613                   |
+---------+--------------------------------+-------------------------------+
| 4       | 337.0414279                    | 350.9636766                   |
+---------+--------------------------------+-------------------------------+
| 8       | 444.2035582                    | 477.1903417                   |
+---------+--------------------------------+-------------------------------+
So yes, the benefits of the patch go down with higher number of clients, but
it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.

#26

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#25)

Re: Faster inserts with mostly-monotonically increasing values

On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

So yes, the benefits of the patch go down with higher number of clients,

but

it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.

Yes, I will try that next - it seems like a good idea. So the idea would
be: check if the block is still the rightmost block and the insertion-key
is greater than the first key in the page. If those conditions are
satisfied, then we do a regular binary search within the page to find the
correct location. This might add an overhead of binary search when keys are
strictly ordered and a single client is inserting the data. If that becomes
a concern, we might be able to look for that special case too and optimise
for it too.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#26)

Re: Faster inserts with mostly-monotonically increasing values

On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

So yes, the benefits of the patch go down with higher number of clients,
but
it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.

Yes, I will try that next - it seems like a good idea. So the idea would be:
check if the block is still the rightmost block and the insertion-key is
greater than the first key in the page. If those conditions are satisfied,
then we do a regular binary search within the page to find the correct
location. This might add an overhead of binary search when keys are strictly
ordered and a single client is inserting the data. If that becomes a
concern, we might be able to look for that special case too and optimise for
it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.

#28

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#27)

1 attachment(s)

Re: Faster inserts with mostly-monotonically increasing values

On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee

Yes, I will try that next - it seems like a good idea. So the idea would

be:

check if the block is still the rightmost block and the insertion-key is
greater than the first key in the page. If those conditions are

satisfied,

then we do a regular binary search within the page to find the correct
location. This might add an overhead of binary search when keys are

strictly

ordered and a single client is inserting the data. If that becomes a
concern, we might be able to look for that special case too and optimise

for

it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.

So I've been toying with this idea since yesterday and I am quite puzzled
with the results. See the attached patch which compares the insertion key
with the last key inserted by this backend, if the cached block is still
the rightmost block in the tree. I initially only compared with the first
key in the page, but I tried this version because of the strange
performance regression which I still have no answers.

For a small number of clients, the patched version does better. But as the
number of clients go up, the patched version significantly underperforms
master. I roughly counted the number of times the fastpath is taken and I
noticed that almost 98% inserts take the fastpath. I first thought that the
"firstkey" location in the page might be becoming a hot-spot for concurrent
processes and hence changed that to track the per-backend last offset and
compare against that the next time. But that did not help much.

+---------+--------------------------------+-------------------------------+
| clients | Master - Avg load time in sec | Patched - Avg load time in sec |
+---------+--------------------------------+-------------------------------+
| 1       | 500.0725203                    | 347.632079                    |
+---------+--------------------------------+-------------------------------+
| 2       | 308.4580771                    | 263.9120163                   |
+---------+--------------------------------+-------------------------------+
| 4       | 359.4851779                    | 514.7187444                   |
+---------+--------------------------------+-------------------------------+
| 8       | 476.4062592                    | 780.2540855                   |
+---------+--------------------------------+-------------------------------+

The perf data does not show anything interesting either. I mean there is a
reduction in CPU time spent in btree related code in the patched version,
but the overall execution time to insert the same number of records go up
significantly.

Perf (master):
===========

+ 72.59% 1.81% postgres postgres [.] ExecInsert

+   47.55%     1.27%  postgres  postgres            [.]
ExecInsertIndexTuples
+   44.24%     0.48%  postgres  postgres            [.] btinsert

- 42.40% 0.87% postgres postgres [.] _bt_doinsert

- 41.52% _bt_doinsert

+ 21.14% _bt_search

+ 12.57% _bt_insertonpg

+ 2.03% _bt_binsrch

1.60% _bt_mkscankey

1.20% LWLockAcquire

+ 1.03% _bt_freestack

0.67% LWLockRelease

0.57% _bt_check_unique

+ 0.87% _start

+ 26.03% 0.95% postgres postgres [.] ExecScan

+ 21.14% 0.82% postgres postgres [.] _bt_search

+ 20.70% 1.31% postgres postgres [.] ExecInterpExpr

+ 19.05% 1.14% postgres postgres [.] heap_insert

+ 18.84% 1.16% postgres postgres [.] nextval_internal

+ 18.08% 0.84% postgres postgres [.] ReadBufferExtended

+ 17.24% 2.03% postgres postgres [.] ReadBuffer_common

+ 12.57% 0.59% postgres postgres [.] _bt_insertonpg

+ 11.12% 1.63% postgres postgres [.] XLogInsert

+ 9.90% 0.10% postgres postgres [.] _bt_relandgetbuf

+ 8.97% 1.16% postgres postgres [.] LWLockAcquire

+ 8.42% 2.03% postgres postgres [.] XLogInsertRecord

+ 7.26% 1.01% postgres postgres [.] _bt_binsrch

+    7.07%     1.20%  postgres  postgres            [.]
RelationGetBufferForTuple
+    6.27%     4.92%  postgres  postgres            [.] _bt_compare

+    5.97%     0.63%  postgres  postgres            [.]
read_seq_tuple.isra.3
+    5.70%     4.89%  postgres  postgres            [.]
hash_search_with_hash_value
+    5.44%     5.44%  postgres  postgres            [.] LWLockAttemptLock

Perf (Patched):
============

+ 69.33% 2.36% postgres postgres [.] ExecInsert

+   35.21%     0.64%  postgres  postgres            [.]
ExecInsertIndexTuples
-   32.14%     0.45%  postgres  postgres            [.] btinsert

- 31.69% btinsert

- 30.35% _bt_doinsert

+ 13.10% _bt_insertonpg

+ 5.11% _bt_getbuf

+ 2.75% _bt_binsrch

+ 2.49% _bt_mkscankey

+ 2.43% _bt_search

+ 0.96% _bt_compare

0.70% CheckForSerializableConflictIn

+ 1.34% index_form_tuple

+ 30.35% 1.53% postgres postgres [.] _bt_doinsert

+ 28.31% 1.53% postgres postgres [.] ExecScan

+ 26.52% 0.77% postgres postgres [.] heap_insert

+ 22.11% 1.21% postgres postgres [.] ExecInterpExpr

+ 20.51% 0.51% postgres postgres [.] nextval_internal

+ 16.55% 1.15% postgres postgres [.] ReadBufferExtended

+ 15.40% 3.77% postgres postgres [.] ReadBuffer_common

+ 14.25% 2.36% postgres postgres [.] XLogInsert

+ 13.10% 0.77% postgres postgres [.] _bt_insertonpg

+ 10.93% 3.07% postgres postgres [.] XLogInsertRecord

+   10.80%     0.96%  postgres  postgres            [.]
RelationGetBufferForTuple
+    7.41%     0.38%  postgres  postgres            [.] _bt_getbuf

+    5.18%     0.58%  postgres  postgres            [.]
read_seq_tuple.isra.3
+    5.18%     0.19%  postgres  postgres            [.] pg_class_aclcheck

I am testing this on r3.2xlarge AWS instance.

Is there something wrong with the patch? What can cause sudden slowdown
with the increasing number of clients, given that the patch is still doing
what it's supposed to do? I wonder if the shortened code path actually
leads to heavier contention for EXCLUSIVE lock on the rightmost page, which
in turn causes the slowdown. But that's just a theory. Any ideas how to
prove or disprove that theory or any other pointers?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_btree_target_block_v2.patchapplication/octet-stream; name=pg_btree_target_block_v2.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2fe9867353..2525667418 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -23,6 +23,7 @@
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/smgr.h"
 #include "utils/tqual.h"
 
 
@@ -85,7 +86,6 @@ static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
-
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
@@ -111,32 +111,95 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		is_unique = false;
 	int			natts = rel->rd_rel->relnatts;
 	ScanKey		itup_scankey;
-	BTStack		stack;
+	BTStack		stack = NULL;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool		fastpath;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
 
+	/*
+	 * It's very common to have an index on an auto-incremented or
+	 * monotonically increasing value. In such cases, every insertion happens
+	 * to the end of the index. We try to optimise that case by caching the
+	 * right-most block of the index. If our cached block is still the
+	 * rightmost block, has enough free space to accommodate a new entry and
+	 * the insertion key is greater or equal to the first key in this page,
+	 * then we can simply insert the new insertion key in the cached block. We
+	 * call it a fastpath.
+	 */
 top:
-	/* find the first page containing this key */
-	stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE, NULL);
+	fastpath = false;
+	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+	{
+		Size 			itemsz;
+		Page			page;
+		BTPageOpaque	lpageop;
 
-	offset = InvalidOffsetNumber;
+		/*
+		 * Acquire exclusive lock on the buffer before doing any checks. This
+		 * ensures that the index state cannot change, as far as the rightmost
+		 * part of the index is concerned.
+		 */
+		buf = _bt_getbuf(rel, RelationGetTargetBlock(rel), BT_WRITE);
+		page = BufferGetPage(buf);
 
-	/* trade in our read lock for a write lock */
-	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	LockBuffer(buf, BT_WRITE);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		itemsz = IndexTupleSize(itup);
+		itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
+									 * need to be consistent */
 
-	/*
-	 * If the page was split between the time that we surrendered our read
-	 * lock and acquired our write lock, then this page may no longer be the
-	 * right place for the key we want to insert.  In this case, we need to
-	 * move right in the tree.  See Lehman and Yao for an excruciatingly
-	 * precise description.
-	 */
-	buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
-						true, stack, BT_WRITE, NULL);
+		/*
+		 * Check if the page is still the rightmost leaf page, has enough free
+		 * space to accommodate the new tuple, no split is in progress and the
+		 * scankey is greater than or equal to the first key on the page.
+		 */
+		if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+			!P_INCOMPLETE_SPLIT(lpageop) &&
+			!P_IGNORE(lpageop) &&
+			(PageGetFreeSpace(page) > itemsz) &&
+			_bt_compare(rel, natts, itup_scankey, page,
+						RelationGetLastOffset(rel)) >= 0)
+		{
+			offset = InvalidOffsetNumber;
+			fastpath = true;
+		}
+		else
+		{
+			_bt_relbuf(rel, buf);
+
+			/*
+			 * Something did not workout. Just forget about the cached block
+			 * and follow the normal path. It might be set again if the
+			 * conditions are favourble.
+			 */
+			RelationSetTargetBlock(rel, InvalidBlockNumber);
+		}
+	}
+
+	if (!fastpath)
+	{
+		/* find the first page containing this key */
+		stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE,
+						   NULL);
+
+		offset = InvalidOffsetNumber;
+
+		/* trade in our read lock for a write lock */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buf, BT_WRITE);
+
+		/*
+		 * If the page was split between the time that we surrendered our read
+		 * lock and acquired our write lock, then this page may no longer be
+		 * the right place for the key we want to insert.  In this case, we
+		 * need to move right in the tree.  See Lehman and Yao for an
+		 * excruciatingly precise description.
+		 */
+		buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+							true, stack, BT_WRITE, NULL);
+	}
 
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
@@ -184,7 +247,8 @@ top:
 				XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
 
 			/* start over... */
-			_bt_freestack(stack);
+			if (stack)
+				_bt_freestack(stack);
 			goto top;
 		}
 	}
@@ -211,7 +275,8 @@ top:
 	}
 
 	/* be tidy */
-	_bt_freestack(stack);
+	if (stack)
+		_bt_freestack(stack);
 	_bt_freeskey(itup_scankey);
 
 	return is_unique;
@@ -879,7 +944,19 @@ _bt_insertonpg(Relation rel,
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
 			if (P_ISLEAF(lpageop))
+			{
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+
+				/*
+				 * Cache the block information if we just inserted into the
+				 * rightmost leaf page of the index.
+				 */
+				if (P_RIGHTMOST(lpageop))
+				{
+					RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
+					RelationSetLastOffset(rel, itup_off);
+				}
+			}
 			else
 			{
 				/*
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 558e4d8518..78452920d1 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "fmgr.h"
 #include "storage/block.h"
+#include "storage/off.h"
 #include "storage/relfilenode.h"
 
 
@@ -53,6 +54,7 @@ typedef struct SMgrRelationData
 	 * happens.  In all three cases, InvalidBlockNumber means "unknown".
 	 */
 	BlockNumber smgr_targblock; /* current insertion target block */
+	OffsetNumber	smgr_lastoffnum;	/* last insertion offset */
 	BlockNumber smgr_fsm_nblocks;	/* last known size of fsm fork */
 	BlockNumber smgr_vm_nblocks;	/* last known size of vm fork */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index aa8add544a..6814b4b118 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -507,6 +507,16 @@ typedef struct ViewOptions
 		(relation)->rd_smgr->smgr_targblock = (targblock); \
 	} while (0)
 
+#define RelationGetLastOffset(relation) \
+	( (relation)->rd_smgr != NULL ? (relation)->rd_smgr->smgr_lastoffnum : InvalidOffsetNumber)
+
+#define RelationSetLastOffset(relation, offnum) \
+	do { \
+		RelationOpenSmgr(relation); \
+		(relation)->rd_smgr->smgr_lastoffnum = (offnum); \
+	} while (0)
+
+
 /*
  * RelationNeedsWAL
  *		True if relation needs WAL.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index fb32ffffdc..f21699bc9a 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1067,6 +1067,241 @@ select tableoid::regclass, * from idxpart order by a;
 (8 rows)
 
 drop table idxpart;
+-- test fastpath mechanism for index insertion
+create table fastpath (a int, b text, c numeric);
+create unique index fpindex1 on fastpath(a);
+insert into fastpath values (1, 'b1', 100.00);
+insert into fastpath values (1, 'b1', 100.00); -- unique key check
+ERROR:  duplicate key value violates unique constraint "fpindex1"
+DETAIL:  Key (a)=(1) already exists.
+truncate fastpath;
+insert into fastpath select generate_series(1,10000), 'b', 100;
+vacuum fastpath;
+set enable_seqscan to false;
+set enable_bitmapscan to false;
+explain select sum(a) from fastpath where a = 6456;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30 rows=1 width=4)
+         Index Cond: (a = 6456)
+(3 rows)
+
+explain select sum(a) from fastpath where a >= 5000 and a < 5700;
+                                     QUERY PLAN                                      
+-------------------------------------------------------------------------------------
+ Aggregate  (cost=5.41..5.42 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..5.29 rows=50 width=4)
+         Index Cond: ((a >= 5000) AND (a < 5700))
+(3 rows)
+
+select sum(a) from fastpath where a = 6456;
+ sum  
+------
+ 6456
+(1 row)
+
+select sum(a) from fastpath where a >= 5000 and a < 5700;
+   sum   
+---------
+ 3744650
+(1 row)
+
+drop index fpindex1;
+create index fpindex2 on fastpath(a, b);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;;
+vacuum fastpath;
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+drop index fpindex2;
+create index fpindex3 on fastpath(a desc, b asc);
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+drop index fpindex3;
+create index fpindex4 on fastpath(a asc, b desc);
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+drop table fastpath;
 -- intentionally leave some objects around
 create table idxpart (a int) partition by range (a);
 create table idxpart1 partition of idxpart for values from (0) to (100);
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index c4ab89fc48..f8cf72c4e7 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -574,6 +574,87 @@ insert into idxpart values (857142, 'six');
 select tableoid::regclass, * from idxpart order by a;
 drop table idxpart;
 
+-- test fastpath mechanism for index insertion
+create table fastpath (a int, b text, c numeric);
+create unique index fpindex1 on fastpath(a);
+
+insert into fastpath values (1, 'b1', 100.00);
+insert into fastpath values (1, 'b1', 100.00); -- unique key check
+
+truncate fastpath;
+insert into fastpath select generate_series(1,10000), 'b', 100;
+
+vacuum fastpath;
+
+set enable_seqscan to false;
+set enable_bitmapscan to false;
+
+explain select sum(a) from fastpath where a = 6456;
+explain select sum(a) from fastpath where a >= 5000 and a < 5700;
+select sum(a) from fastpath where a = 6456;
+select sum(a) from fastpath where a >= 5000 and a < 5700;
+
+drop index fpindex1;
+create index fpindex2 on fastpath(a, b);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;;
+vacuum fastpath;
+
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop index fpindex2;
+create index fpindex3 on fastpath(a desc, b asc);
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop index fpindex3;
+create index fpindex4 on fastpath(a asc, b desc);
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop table fastpath;
+
 -- intentionally leave some objects around
 create table idxpart (a int) partition by range (a);
 create table idxpart1 partition of idxpart for values from (0) to (100);

#29

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#28)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee

Yes, I will try that next - it seems like a good idea. So the idea would
be:
check if the block is still the rightmost block and the insertion-key is
greater than the first key in the page. If those conditions are
satisfied,
then we do a regular binary search within the page to find the correct
location. This might add an overhead of binary search when keys are
strictly
ordered and a single client is inserting the data. If that becomes a
concern, we might be able to look for that special case too and optimise
for
it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.

So I've been toying with this idea since yesterday and I am quite puzzled
with the results. See the attached patch which compares the insertion key
with the last key inserted by this backend, if the cached block is still the
rightmost block in the tree. I initially only compared with the first key in
the page, but I tried this version because of the strange performance
regression which I still have no answers.

For a small number of clients, the patched version does better. But as the
number of clients go up, the patched version significantly underperforms
master. I roughly counted the number of times the fastpath is taken and I
noticed that almost 98% inserts take the fastpath. I first thought that the
"firstkey" location in the page might be becoming a hot-spot for concurrent
processes and hence changed that to track the per-backend last offset and
compare against that the next time. But that did not help much.
+---------+--------------------------------+-------------------------------+
| clients | Master - Avg load time in sec | Patched - Avg load time in sec |
+---------+--------------------------------+-------------------------------+
| 1       | 500.0725203                    | 347.632079                    |
+---------+--------------------------------+-------------------------------+
| 2       | 308.4580771                    | 263.9120163                   |
+---------+--------------------------------+-------------------------------+
| 4       | 359.4851779                    | 514.7187444                   |
+---------+--------------------------------+-------------------------------+
| 8       | 476.4062592                    | 780.2540855                   |
+---------+--------------------------------+-------------------------------+
The perf data does not show anything interesting either. I mean there is a
reduction in CPU time spent in btree related code in the patched version,
but the overall execution time to insert the same number of records go up
significantly.

Perf (master):
===========
+   72.59%     1.81%  postgres  postgres            [.] ExecInsert
+   47.55%     1.27%  postgres  postgres            [.]
ExecInsertIndexTuples
+   44.24%     0.48%  postgres  postgres            [.] btinsert
-   42.40%     0.87%  postgres  postgres            [.] _bt_doinsert
- 41.52% _bt_doinsert
+ 21.14% _bt_search
+ 12.57% _bt_insertonpg
+ 2.03% _bt_binsrch
1.60% _bt_mkscankey
1.20% LWLockAcquire
+ 1.03% _bt_freestack
0.67% LWLockRelease
0.57% _bt_check_unique
+ 0.87% _start
+   26.03%     0.95%  postgres  postgres            [.] ExecScan
+   21.14%     0.82%  postgres  postgres            [.] _bt_search
+   20.70%     1.31%  postgres  postgres            [.] ExecInterpExpr
+   19.05%     1.14%  postgres  postgres            [.] heap_insert
+   18.84%     1.16%  postgres  postgres            [.] nextval_internal
+   18.08%     0.84%  postgres  postgres            [.] ReadBufferExtended
+   17.24%     2.03%  postgres  postgres            [.] ReadBuffer_common
+   12.57%     0.59%  postgres  postgres            [.] _bt_insertonpg
+   11.12%     1.63%  postgres  postgres            [.] XLogInsert
+    9.90%     0.10%  postgres  postgres            [.] _bt_relandgetbuf
+    8.97%     1.16%  postgres  postgres            [.] LWLockAcquire
+    8.42%     2.03%  postgres  postgres            [.] XLogInsertRecord
+    7.26%     1.01%  postgres  postgres            [.] _bt_binsrch
+    7.07%     1.20%  postgres  postgres            [.]
RelationGetBufferForTuple
+    6.27%     4.92%  postgres  postgres            [.] _bt_compare
+    5.97%     0.63%  postgres  postgres            [.]
read_seq_tuple.isra.3
+    5.70%     4.89%  postgres  postgres            [.]
hash_search_with_hash_value
+    5.44%     5.44%  postgres  postgres            [.] LWLockAttemptLock
Perf (Patched):
============
+   69.33%     2.36%  postgres  postgres            [.] ExecInsert
+   35.21%     0.64%  postgres  postgres            [.]
ExecInsertIndexTuples
-   32.14%     0.45%  postgres  postgres            [.] btinsert
- 31.69% btinsert
- 30.35% _bt_doinsert
+ 13.10% _bt_insertonpg
+ 5.11% _bt_getbuf
+ 2.75% _bt_binsrch
+ 2.49% _bt_mkscankey
+ 2.43% _bt_search
+ 0.96% _bt_compare
0.70% CheckForSerializableConflictIn
+ 1.34% index_form_tuple

_bt_getbuf doesn't even show up in master, and neither does
CheckForSerializableConflictIn. WAL stuff also went up quite a bit.

I'm thinking there could be contention on some lock somewhere.

Can you attach the benchmark script you're using so I can try to reproduce it?

#30

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#29)

1 attachment(s)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 10:58 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I'm thinking there could be contention on some lock somewhere.

Can you attach the benchmark script you're using so I can try to reproduce
it?

I am using a very ad-hoc script.. But here it is.. It assumes a presence of
a branch named "btree_rightmost" with the patched code.

You will need to make necessary adjustments of course.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31

Simon Riggs

simon.riggs@2ndquadrant.com

almost 8 years ago

In reply to: Pavan Deolasee (#28)

Re: Faster inserts with mostly-monotonically increasing values

On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I wonder if the shortened code path actually leads to
heavier contention for EXCLUSIVE lock on the rightmost page, which in turn
causes the slowdown. But that's just a theory. Any ideas how to prove or
disprove that theory or any other pointers?

Certainly: dropping performance with higher sessions means increased
contention is responsible. Your guess is likely correct.

Suggest making the patch attempt a ConditionalLock on the target
block, so if its contended we try from top of index. So basically only
attempt the optimization if uncontended. This makes sense because in
many cases if the block is locked it is filling up and the RHS block
is more likely to change anyway.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#32

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Simon Riggs (#31)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <simon.riggs@2ndquadrant.com>
wrote:

On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

I wonder if the shortened code path actually leads to
heavier contention for EXCLUSIVE lock on the rightmost page, which in

turn

causes the slowdown. But that's just a theory. Any ideas how to prove or
disprove that theory or any other pointers?

Certainly: dropping performance with higher sessions means increased
contention is responsible. Your guess is likely correct.

Suggest making the patch attempt a ConditionalLock on the target
block, so if its contended we try from top of index. So basically only
attempt the optimization if uncontended. This makes sense because in
many cases if the block is locked it is filling up and the RHS block
is more likely to change anyway.

Hmm. I can try that. It's quite puzzling though that slowing down things
actually make them better.

I can also try to reduce the amount of time EX lock is held by doing some
checks with a SHARE lock and then trade it for EX lock if we conclude that
fast path can be taken. We can do page lsn check to confirm nothing changed
when the lock was traded. Will check both approaches.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#33

Simon Riggs

simon.riggs@2ndquadrant.com

almost 8 years ago

In reply to: Pavan Deolasee (#32)

Re: Faster inserts with mostly-monotonically increasing values

On 14 March 2018 at 13:33, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <simon.riggs@2ndquadrant.com>
wrote:

On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

I wonder if the shortened code path actually leads to
heavier contention for EXCLUSIVE lock on the rightmost page, which in
turn
causes the slowdown. But that's just a theory. Any ideas how to prove or
disprove that theory or any other pointers?

Certainly: dropping performance with higher sessions means increased
contention is responsible. Your guess is likely correct.

Suggest making the patch attempt a ConditionalLock on the target
block, so if its contended we try from top of index. So basically only
attempt the optimization if uncontended. This makes sense because in
many cases if the block is locked it is filling up and the RHS block
is more likely to change anyway.

Hmm. I can try that. It's quite puzzling though that slowing down things
actually make them better.

That's not what is happening though.

The cache path is 1) try to lock cached block, 2) when got lock check
relevance, if stale 3) recheck from top

The non-cached path is just 3) recheck from top

The overall path length is longer in the cached case but provides
benefit if we can skip step 3 in high % of cases. The non-cached path
is more flexible because it locates the correct RHS block, even if it
changes dynamically between starting the recheck from top.

If there is enough delay in step 1 then any benefit is lost anyway, so
there is no point taking the cached path even when successful, so its
better to ignore in that case.

I can also try to reduce the amount of time EX lock is held by doing some
checks with a SHARE lock and then trade it for EX lock if we conclude that
fast path can be taken. We can do page lsn check to confirm nothing changed
when the lock was traded. Will check both approaches.

That will still be slow.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Simon Riggs (#33)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 10:51 AM, Simon Riggs
<simon.riggs@2ndquadrant.com> wrote:

If there is enough delay in step 1 then any benefit is lost anyway, so
there is no point taking the cached path even when successful, so its
better to ignore in that case.

I find myself agreeing with that.

#35

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Simon Riggs (#33)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 7:21 PM, Simon Riggs <simon.riggs@2ndquadrant.com>
wrote:

On 14 March 2018 at 13:33, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Wed, Mar 14, 2018 at 4:53 PM, Simon Riggs <

simon.riggs@2ndquadrant.com>

wrote:

On 14 March 2018 at 04:36, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

I wonder if the shortened code path actually leads to
heavier contention for EXCLUSIVE lock on the rightmost page, which in
turn
causes the slowdown. But that's just a theory. Any ideas how to prove

or

disprove that theory or any other pointers?

Certainly: dropping performance with higher sessions means increased
contention is responsible. Your guess is likely correct.

Suggest making the patch attempt a ConditionalLock on the target
block, so if its contended we try from top of index. So basically only
attempt the optimization if uncontended. This makes sense because in
many cases if the block is locked it is filling up and the RHS block
is more likely to change anyway.

Hmm. I can try that. It's quite puzzling though that slowing down things
actually make them better.

That's not what is happening though.

The cache path is 1) try to lock cached block, 2) when got lock check
relevance, if stale 3) recheck from top

The non-cached path is just 3) recheck from top

The overall path length is longer in the cached case but provides
benefit if we can skip step 3 in high % of cases. The non-cached path
is more flexible because it locates the correct RHS block, even if it
changes dynamically between starting the recheck from top.

So as I noted in one of the previous emails, the revised patch still takes
fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3
in just 2% cases should cause such dramatic slowdown.

If there is enough delay in step 1 then any benefit is lost anyway, so

there is no point taking the cached path even when successful, so its
better to ignore in that case.

The non-cached path is also going to land on the same page, it just that
the _bt_search() will ensure that it doesn't happen quickly. So my
understanding that the additional time spent in _bt_search() accidentally
reduces contention on the EX lock and ends up improving net performance. I
know it sounds weird..

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#36

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#28)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee

Yes, I will try that next - it seems like a good idea. So the idea would
be:
check if the block is still the rightmost block and the insertion-key is
greater than the first key in the page. If those conditions are
satisfied,
then we do a regular binary search within the page to find the correct
location. This might add an overhead of binary search when keys are
strictly
ordered and a single client is inserting the data. If that becomes a
concern, we might be able to look for that special case too and optimise
for
it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.

So I've been toying with this idea since yesterday and I am quite puzzled
with the results. See the attached patch which compares the insertion key
with the last key inserted by this backend, if the cached block is still the
rightmost block in the tree. I initially only compared with the first key in
the page, but I tried this version because of the strange performance
regression which I still have no answers.

For a small number of clients, the patched version does better. But as the
number of clients go up, the patched version significantly underperforms
master. I roughly counted the number of times the fastpath is taken and I
noticed that almost 98% inserts take the fastpath. I first thought that the
"firstkey" location in the page might be becoming a hot-spot for concurrent
processes and hence changed that to track the per-backend last offset and
compare against that the next time. But that did not help much.

+            _bt_compare(rel, natts, itup_scankey, page,
+                        RelationGetLastOffset(rel)) >= 0)

Won't this access garbage if the last offset is stale and beyond the
current rightmost page's last offset?

I'd suggest simply using P_FIRSTDATAKEY after checking that the page
isn't empty (see _bt_binsrch).

On Wed, Mar 14, 2018 at 11:12 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Hmm. I can try that. It's quite puzzling though that slowing down things
actually make them better.

That's not what is happening though.

The cache path is 1) try to lock cached block, 2) when got lock check
relevance, if stale 3) recheck from top

The non-cached path is just 3) recheck from top

The overall path length is longer in the cached case but provides
benefit if we can skip step 3 in high % of cases. The non-cached path
is more flexible because it locates the correct RHS block, even if it
changes dynamically between starting the recheck from top.

So as I noted in one of the previous emails, the revised patch still takes
fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in
just 2% cases should cause such dramatic slowdown.

Real-world workloads will probably take the slow path more often, so
it's probably worth keeping the failure path as contention-free as
possible.

Besides, even though it may be just 2% the times it lands there, it
could still block for a considerable amount of time for no benefit.

So I guess a conditional lock is not a bad idea.

#37

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#36)

2 attachment(s)

Re: Faster inserts with mostly-monotonically increasing values

On Wed, Mar 14, 2018 at 12:05 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee

Yes, I will try that next - it seems like a good idea. So the idea would
be:
check if the block is still the rightmost block and the insertion-key is
greater than the first key in the page. If those conditions are
satisfied,
then we do a regular binary search within the page to find the correct
location. This might add an overhead of binary search when keys are
strictly
ordered and a single client is inserting the data. If that becomes a
concern, we might be able to look for that special case too and optimise
for
it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.

So I've been toying with this idea since yesterday and I am quite puzzled
with the results. See the attached patch which compares the insertion key
with the last key inserted by this backend, if the cached block is still the
rightmost block in the tree. I initially only compared with the first key in
the page, but I tried this version because of the strange performance
regression which I still have no answers.

For a small number of clients, the patched version does better. But as the
number of clients go up, the patched version significantly underperforms
master. I roughly counted the number of times the fastpath is taken and I
noticed that almost 98% inserts take the fastpath. I first thought that the
"firstkey" location in the page might be becoming a hot-spot for concurrent
processes and hence changed that to track the per-backend last offset and
compare against that the next time. But that did not help much.
+            _bt_compare(rel, natts, itup_scankey, page,
+                        RelationGetLastOffset(rel)) >= 0)
Won't this access garbage if the last offset is stale and beyond the
current rightmost page's last offset?

I'd suggest simply using P_FIRSTDATAKEY after checking that the page
isn't empty (see _bt_binsrch).

On Wed, Mar 14, 2018 at 11:12 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Hmm. I can try that. It's quite puzzling though that slowing down things
actually make them better.

That's not what is happening though.

The cache path is 1) try to lock cached block, 2) when got lock check
relevance, if stale 3) recheck from top

The non-cached path is just 3) recheck from top

The overall path length is longer in the cached case but provides
benefit if we can skip step 3 in high % of cases. The non-cached path
is more flexible because it locates the correct RHS block, even if it
changes dynamically between starting the recheck from top.

So as I noted in one of the previous emails, the revised patch still takes
fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in
just 2% cases should cause such dramatic slowdown.

Real-world workloads will probably take the slow path more often, so
it's probably worth keeping the failure path as contention-free as
possible.

Besides, even though it may be just 2% the times it lands there, it
could still block for a considerable amount of time for no benefit.

So I guess a conditional lock is not a bad idea.

Re all of the above, I did some tests.

I couldn't reproduce the regression you showed so clearly, in fact at
all, but I was able to get consistently lock-bound with 8 clients by
setting fsync=off.

Something noteworthy, is that both master and patched are lock-bound.
It just must be that when that happens, the extra check for the
rightmost page really hurts.

So, I tried the conditional locks, and indeed it (at least in my
meager hardware) turns the lock-bound test into an I/O bound one.

I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.

Attachments:

0001-Ignore-last-offset.patchtext/x-patch; charset=US-ASCII; name=0001-Ignore-last-offset.patchDownload

From 95d558c50b95729d109aa0135795c1887812ee1d Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfreire@gmail.com>
Date: Wed, 14 Mar 2018 19:22:29 -0300
Subject: [PATCH 1/2] Ignore last offset

---
 src/backend/access/nbtree/nbtinsert.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2525667..ab239c4 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -159,8 +159,9 @@ top:
 			!P_INCOMPLETE_SPLIT(lpageop) &&
 			!P_IGNORE(lpageop) &&
 			(PageGetFreeSpace(page) > itemsz) &&
+			PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
 			_bt_compare(rel, natts, itup_scankey, page,
-						RelationGetLastOffset(rel)) >= 0)
+						P_FIRSTDATAKEY(lpageop)) > 0)
 		{
 			offset = InvalidOffsetNumber;
 			fastpath = true;
-- 
1.8.4.5

0002-Conditional-locks.patchtext/x-patch; charset=US-ASCII; name=0002-Conditional-locks.patchDownload

From 4ce6abc456414345c3d04101776c50504d345768 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfreire@gmail.com>
Date: Wed, 14 Mar 2018 22:15:31 -0300
Subject: [PATCH 2/2] Conditional locks

---
 src/backend/access/nbtree/nbtinsert.c | 68 +++++++++++++++++++++--------------
 1 file changed, 42 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ab239c4..f9afb4d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -142,38 +142,54 @@ top:
 		 * ensures that the index state cannot change, as far as the rightmost
 		 * part of the index is concerned.
 		 */
-		buf = _bt_getbuf(rel, RelationGetTargetBlock(rel), BT_WRITE);
-		page = BufferGetPage(buf);
+		/* buf = _bt_getbuf(rel, RelationGetTargetBlock(rel), BT_WRITE); */
+		buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
+
+		if (ConditionalLockBuffer(buf))
+		{
+			_bt_checkpage(rel, buf);
 
-		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
-		itemsz = IndexTupleSize(itup);
-		itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
+			page = BufferGetPage(buf);
+
+			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			itemsz = IndexTupleSize(itup);
+			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 									 * need to be consistent */
 
-		/*
-		 * Check if the page is still the rightmost leaf page, has enough free
-		 * space to accommodate the new tuple, no split is in progress and the
-		 * scankey is greater than or equal to the first key on the page.
-		 */
-		if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
-			!P_INCOMPLETE_SPLIT(lpageop) &&
-			!P_IGNORE(lpageop) &&
-			(PageGetFreeSpace(page) > itemsz) &&
-			PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-			_bt_compare(rel, natts, itup_scankey, page,
-						P_FIRSTDATAKEY(lpageop)) > 0)
-		{
-			offset = InvalidOffsetNumber;
-			fastpath = true;
+			/*
+			 * Check if the page is still the rightmost leaf page, has enough free
+			 * space to accommodate the new tuple, no split is in progress and the
+			 * scankey is greater than or equal to the first key on the page.
+			 */
+			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+				!P_INCOMPLETE_SPLIT(lpageop) &&
+				!P_IGNORE(lpageop) &&
+				(PageGetFreeSpace(page) > itemsz) &&
+				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+				_bt_compare(rel, natts, itup_scankey, page,
+							P_FIRSTDATAKEY(lpageop)) > 0)
+			{
+				offset = InvalidOffsetNumber;
+				fastpath = true;
+			}
+			else
+			{
+				_bt_relbuf(rel, buf);
+
+				/*
+				 * Something did not workout. Just forget about the cached block
+				 * and follow the normal path. It might be set again if the
+				 * conditions are favourble.
+				 */
+				RelationSetTargetBlock(rel, InvalidBlockNumber);
+			}
 		}
-		else
-		{
-			_bt_relbuf(rel, buf);
+		else {
+			ReleaseBuffer(buf);
 
 			/*
-			 * Something did not workout. Just forget about the cached block
-			 * and follow the normal path. It might be set again if the
-			 * conditions are favourble.
+			 * If someone's holding a lock, it's likely to change anyway,
+			 * so don't try again until we get an updated rightmost leaf.
 			 */
 			RelationSetTargetBlock(rel, InvalidBlockNumber);
 		}
-- 
1.8.4.5

#38

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#37)

Re: Faster inserts with mostly-monotonically increasing values

On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.

I can confirm that I no longer see any regression at 8 or even 16 clients.
In fact, the patched version consistent do better than master even at
higher number of clients.

The only thing I am a bit puzzled is that I am no longer able to reproduce
the 40-50% gains that I'd earlier observed with a single client. I now get
20-25% with smaller number of client and 10-15% with larger number of
clients. I haven't been able to establish why it's happening, but since
it's a different AWS instance (though of the same type), I am inclined to
blame it on that for now.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#38)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 19, 2018 at 11:57 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.

I can confirm that I no longer see any regression at 8 or even 16 clients.
In fact, the patched version consistent do better than master even at higher
number of clients.

The only thing I am a bit puzzled is that I am no longer able to reproduce
the 40-50% gains that I'd earlier observed with a single client. I now get
20-25% with smaller number of client and 10-15% with larger number of
clients. I haven't been able to establish why it's happening, but since it's
a different AWS instance (though of the same type), I am inclined to blame
it on that for now.

Your original patch also skipped the check for serializable conflicts.

*IF* that was correct, it probably further reduced contention. I'm not
entirely sure if skipping that check is correct, but if it is, you can
still accomplish this checking whether the new key is beyond the
current rightmost key, and setting a flag so that check is later
skipped.

But there are lots of complex interactions in that code and I'm no
expert there, I'd prefer if someone more knowledgeable could confirm
whether it's safe to skip that check or not. Or leave it just in case.

#40

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#39)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 19, 2018 at 12:06 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Mon, Mar 19, 2018 at 11:57 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.

I can confirm that I no longer see any regression at 8 or even 16 clients.
In fact, the patched version consistent do better than master even at higher
number of clients.

The only thing I am a bit puzzled is that I am no longer able to reproduce
the 40-50% gains that I'd earlier observed with a single client. I now get
20-25% with smaller number of client and 10-15% with larger number of
clients. I haven't been able to establish why it's happening, but since it's
a different AWS instance (though of the same type), I am inclined to blame
it on that for now.

Your original patch also skipped the check for serializable conflicts.

*IF* that was correct, it probably further reduced contention. I'm not
entirely sure if skipping that check is correct, but if it is, you can
still accomplish this checking whether the new key is beyond the
current rightmost key, and setting a flag so that check is later
skipped.

But there are lots of complex interactions in that code and I'm no
expert there, I'd prefer if someone more knowledgeable could confirm
whether it's safe to skip that check or not. Or leave it just in case.

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.

#41

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#40)

1 attachment(s)

Re: Faster inserts with mostly-monotonically increasing values

On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.

Here is an updated patch. I removed the last offset caching thing
completely and integrated your changes for conditional lock access. Some
other improvements to test cases too. I realised that we must truncate and
re-insert to test index fastpath correctly.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_btree_target_block_v3.patchapplication/octet-stream; name=pg_btree_target_block_v3.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2fe9867353..0bcef96838 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -23,6 +23,7 @@
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/smgr.h"
 #include "utils/tqual.h"
 
 
@@ -85,7 +86,6 @@ static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
-
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
@@ -111,32 +111,123 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		is_unique = false;
 	int			natts = rel->rd_rel->relnatts;
 	ScanKey		itup_scankey;
-	BTStack		stack;
+	BTStack		stack = NULL;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool		fastpath;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
 
+	/*
+	 * It's very common to have an index on an auto-incremented or
+	 * monotonically increasing value. In such cases, every insertion happens
+	 * towards the end of the index. We try to optimise that case by caching
+	 * the right-most block of the index. If our cached block is still the
+	 * rightmost block, has enough free space to accommodate a new entry and
+	 * the insertion key is greater or equal to the first key in this page,
+	 * then we can safely conclude that the new key will be inserted in the
+	 * cached block. So we simply search within the cached page and insert the
+	 * key at the appropriate location. We call it a fastpath.
+	 *
+	 * Testing has revealed though that the fastpath can result in increased
+	 * contention on the exclusive-lock on the rightmost page. So we
+	 * conditionally check if the lock is available. If it's not available then
+	 * we simply abandon the fastpath and take the regular path. This makes
+	 * sense because unavailability of the lock also signals that some other
+	 * backend might be concurrently inserting into the page, thus reducing our
+	 * chances to finding an insertion place in this page.
+	 */
 top:
-	/* find the first page containing this key */
-	stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE, NULL);
+	fastpath = false;
+	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+	{
+		Size 			itemsz;
+		Page			page;
+		BTPageOpaque	lpageop;
+
+		/*
+		 * Conditionally acquire exclusive lock on the buffer before doing any
+		 * checks. If we don't get the lock, we simply follow slowpath. If we
+		 * do get the lock, this ensures that the index state cannot change, as
+		 * far as the rightmost part of the index is concerned.
+		 */
+		buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
 
-	offset = InvalidOffsetNumber;
+		if (ConditionalLockBuffer(buf))
+		{
+			_bt_checkpage(rel, buf);
 
-	/* trade in our read lock for a write lock */
-	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	LockBuffer(buf, BT_WRITE);
+			page = BufferGetPage(buf);
 
-	/*
-	 * If the page was split between the time that we surrendered our read
-	 * lock and acquired our write lock, then this page may no longer be the
-	 * right place for the key we want to insert.  In this case, we need to
-	 * move right in the tree.  See Lehman and Yao for an excruciatingly
-	 * precise description.
-	 */
-	buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
-						true, stack, BT_WRITE, NULL);
+			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			itemsz = IndexTupleSize(itup);
+			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this
+										 * but we need to be consistent */
+
+			/*
+			 * Check if the page is still the rightmost leaf page, has enough
+			 * free space to accommodate the new tuple, no split is in progress
+			 * and the scankey is greater than or equal to the first key on the
+			 * page.
+			 */
+			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+					!P_INCOMPLETE_SPLIT(lpageop) &&
+					!P_IGNORE(lpageop) &&
+					(PageGetFreeSpace(page) > itemsz) &&
+					PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+					_bt_compare(rel, natts, itup_scankey, page,
+						P_FIRSTDATAKEY(lpageop)) > 0)
+			{
+				offset = InvalidOffsetNumber;
+				fastpath = true;
+			}
+			else
+			{
+				_bt_relbuf(rel, buf);
+
+				/*
+				 * Something did not workout. Just forget about the cached
+				 * block and follow the normal path. It might be set again if
+				 * the conditions are favourble.
+				 */
+				RelationSetTargetBlock(rel, InvalidBlockNumber);
+			}
+		}
+		else
+		{
+			ReleaseBuffer(buf);
+
+			/*
+			 * If someone's holding a lock, it's likely to change anyway,
+			 * so don't try again until we get an updated rightmost leaf.
+			 */
+			RelationSetTargetBlock(rel, InvalidBlockNumber);
+		}
+	}
+
+	if (!fastpath)
+	{
+		/* find the first page containing this key */
+		stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE,
+						   NULL);
+
+		offset = InvalidOffsetNumber;
+
+		/* trade in our read lock for a write lock */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buf, BT_WRITE);
+
+		/*
+		 * If the page was split between the time that we surrendered our read
+		 * lock and acquired our write lock, then this page may no longer be
+		 * the right place for the key we want to insert.  In this case, we
+		 * need to move right in the tree.  See Lehman and Yao for an
+		 * excruciatingly precise description.
+		 */
+		buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+							true, stack, BT_WRITE, NULL);
+	}
 
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
@@ -184,7 +275,8 @@ top:
 				XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
 
 			/* start over... */
-			_bt_freestack(stack);
+			if (stack)
+				_bt_freestack(stack);
 			goto top;
 		}
 	}
@@ -211,7 +303,8 @@ top:
 	}
 
 	/* be tidy */
-	_bt_freestack(stack);
+	if (stack)
+		_bt_freestack(stack);
 	_bt_freeskey(itup_scankey);
 
 	return is_unique;
@@ -879,7 +972,16 @@ _bt_insertonpg(Relation rel,
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
 			if (P_ISLEAF(lpageop))
+			{
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+
+				/*
+				 * Cache the block information if we just inserted into the
+				 * rightmost leaf page of the index.
+				 */
+				if (P_RIGHTMOST(lpageop))
+					RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
+			}
 			else
 			{
 				/*
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index fb32ffffdc..837ce56b48 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1067,6 +1067,247 @@ select tableoid::regclass, * from idxpart order by a;
 (8 rows)
 
 drop table idxpart;
+-- test fastpath mechanism for index insertion
+create table fastpath (a int, b text, c numeric);
+create unique index fpindex1 on fastpath(a);
+insert into fastpath values (1, 'b1', 100.00);
+insert into fastpath values (1, 'b1', 100.00); -- unique key check
+ERROR:  duplicate key value violates unique constraint "fpindex1"
+DETAIL:  Key (a)=(1) already exists.
+truncate fastpath;
+insert into fastpath select generate_series(1,10000), 'b', 100;
+vacuum fastpath;
+set enable_seqscan to false;
+set enable_bitmapscan to false;
+explain select sum(a) from fastpath where a = 6456;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30 rows=1 width=4)
+         Index Cond: (a = 6456)
+(3 rows)
+
+explain select sum(a) from fastpath where a >= 5000 and a < 5700;
+                                     QUERY PLAN                                      
+-------------------------------------------------------------------------------------
+ Aggregate  (cost=5.41..5.42 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..5.29 rows=50 width=4)
+         Index Cond: ((a >= 5000) AND (a < 5700))
+(3 rows)
+
+select sum(a) from fastpath where a = 6456;
+ sum  
+------
+ 6456
+(1 row)
+
+select sum(a) from fastpath where a >= 5000 and a < 5700;
+   sum   
+---------
+ 3744650
+(1 row)
+
+drop index fpindex1;
+create index fpindex2 on fastpath(a, b);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex2 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+drop index fpindex2;
+create index fpindex3 on fastpath(a desc, b asc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex3 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+drop index fpindex3;
+create index fpindex4 on fastpath(a asc, b desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+                                         QUERY PLAN                                         
+--------------------------------------------------------------------------------------------
+ Aggregate  (cost=5.06..5.07 rows=1 width=32)
+   ->  Index Only Scan using fpindex4 on fastpath  (cost=0.29..5.04 rows=1 width=36)
+         Index Cond: ((a >= 1000) AND (a < 2000) AND (b > 'b1'::text) AND (b < 'b3'::text))
+(3 rows)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+drop table fastpath;
 -- intentionally leave some objects around
 create table idxpart (a int) partition by range (a);
 create table idxpart1 partition of idxpart for values from (0) to (100);
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index c4ab89fc48..d4f2cf5888 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -574,6 +574,93 @@ insert into idxpart values (857142, 'six');
 select tableoid::regclass, * from idxpart order by a;
 drop table idxpart;
 
+-- test fastpath mechanism for index insertion
+create table fastpath (a int, b text, c numeric);
+create unique index fpindex1 on fastpath(a);
+
+insert into fastpath values (1, 'b1', 100.00);
+insert into fastpath values (1, 'b1', 100.00); -- unique key check
+
+truncate fastpath;
+insert into fastpath select generate_series(1,10000), 'b', 100;
+
+vacuum fastpath;
+
+set enable_seqscan to false;
+set enable_bitmapscan to false;
+
+explain select sum(a) from fastpath where a = 6456;
+explain select sum(a) from fastpath where a >= 5000 and a < 5700;
+select sum(a) from fastpath where a = 6456;
+select sum(a) from fastpath where a >= 5000 and a < 5700;
+
+drop index fpindex1;
+create index fpindex2 on fastpath(a, b);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop index fpindex2;
+create index fpindex3 on fastpath(a desc, b asc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop index fpindex3;
+create index fpindex4 on fastpath(a asc, b desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+explain select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+explain select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop table fastpath;
+
 -- intentionally leave some objects around
 create table idxpart (a int) partition by range (a);
 create table idxpart1 partition of idxpart for values from (0) to (100);

#42

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#41)

Re: Faster inserts with mostly-monotonically increasing values

On Thu, Mar 22, 2018 at 3:29 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.

Here is an updated patch. I removed the last offset caching thing completely
and integrated your changes for conditional lock access. Some other
improvements to test cases too. I realised that we must truncate and
re-insert to test index fastpath correctly.

Thanks.

Some comments

+    /*
+     * It's very common to have an index on an auto-incremented or
+     * monotonically increasing value. In such cases, every insertion happens
+     * towards the end of the index. We try to optimise that case by caching
+     * the right-most block of the index. If our cached block is still the
+     * rightmost block, has enough free space to accommodate a new entry and
+     * the insertion key is greater or equal to the first key in this page,
+     * then we can safely conclude that the new key will be inserted in the
+     * cached block. So we simply search within the cached page and insert the
+     * key at the appropriate location. We call it a fastpath.

It should say "the insertion key is strictly greater than the first key"

Equal cases cannot be accepted since it would break uniqueness checks,
so the comment should say that. The code already is correctly checking
for strict inequality, it's just the comment that is out of sync.

You might accept equal keys for non-unique indexes, but I don't
believe making that distinction in the code is worth the hassle.

Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
difference). So it should say "rightmost leaf page".

+            if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+                    !P_INCOMPLETE_SPLIT(lpageop) &&
+                    !P_IGNORE(lpageop) &&
+                    (PageGetFreeSpace(page) > itemsz) &&
+                    PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+                    _bt_compare(rel, natts, itup_scankey, page,
+                        P_FIRSTDATAKEY(lpageop)) > 0)
+            {
+                offset = InvalidOffsetNumber;
+                fastpath = true;
+            }
...later...
+    if (!fastpath)
+    {
+        /* find the first page containing this key */
+        stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE,
+                           NULL);
+
+        offset = InvalidOffsetNumber;

Setting "offset = InvalidOffsetNumber" in that contorted way is
unnecessary. You can remove the first assignment and instead
initialize unconditionally right after the fastpath block (that
doesn't make use of offset anyway):

In indexing.out:

+explain select sum(a) from fastpath where a = 6456;
+                                     QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30
rows=1 width=4)
+         Index Cond: (a = 6456)
+(3 rows)

Having costs in explain tests can be fragile. Better use "explain
(costs off)". If you run "make check" continuously in a loop, you
should get some failures related to that pretty quickly.

+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from
(select generate_series(1,10000) as x) y;
+vacuum fastpath;

Vacuum can also be a pain. In parallel tests, vacuum won't do what you
expect it to do, as other concurrent transactions will prevent it
from, say, marking all-visible pages.

In fact, you can get those spurious failures by running "make -j8
check-world" repeatedly (with tap tests enabled). That causes enough
concurrency to cause trouble and sporadic failures.

Question whether you need it. I'm not sure why you need the explain
tests, which I imagine are the reason why you need that vacuum there.

If you do need it, I successfully used the following helper in another patch:

CREATE FUNCTION wait_barrier() RETURNS void LANGUAGE plpgsql AS $$
DECLARE vxids text[];
BEGIN
-- wait for all transactions to commit to make deleted tuples vacuumable
SELECT array_agg(DISTINCT virtualtransaction) FROM pg_locks WHERE pid
<> pg_backend_pid() INTO vxids;
WHILE (SELECT count(virtualtransaction) FROM pg_locks WHERE pid <>
pg_backend_pid() AND virtualtransaction = ANY(vxids)) > 0 LOOP
PERFORM pg_sleep(0.1);
END LOOP;
END
$$;

You call it right before vacuum to make sure all potentially
troublesome transactions finished by the time vacuum runs:

SELECT wait_barrier();
VACUUM blah;

Name the function something else, though, to avoid name clashing with
that patch. Also... that function up there is also still up for review
itself, so take it with a grain of salt. It worked for me.

#43

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

almost 8 years ago

In reply to: Claudio Freire (#42)

Re: Faster inserts with mostly-monotonically increasing values

On Fri, Mar 23, 2018 at 10:20 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, Mar 22, 2018 at 3:29 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.

Here is an updated patch. I removed the last offset caching thing completely
and integrated your changes for conditional lock access. Some other
improvements to test cases too. I realised that we must truncate and
re-insert to test index fastpath correctly.

This patch looks in pretty good shape. I have been trying hard to
think of some failure mode but haven't been able to come up with one.

We can put off for another day consideration of if/when e can skip the
check for serialization conflict.

Some comments

+    /*
+     * It's very common to have an index on an auto-incremented or
+     * monotonically increasing value. In such cases, every insertion happens
+     * towards the end of the index. We try to optimise that case by caching
+     * the right-most block of the index. If our cached block is still the
+     * rightmost block, has enough free space to accommodate a new entry and
+     * the insertion key is greater or equal to the first key in this page,
+     * then we can safely conclude that the new key will be inserted in the
+     * cached block. So we simply search within the cached page and insert the
+     * key at the appropriate location. We call it a fastpath.

It should say "the insertion key is strictly greater than the first key"

right.

Equal cases cannot be accepted since it would break uniqueness checks,
so the comment should say that. The code already is correctly checking
for strict inequality, it's just the comment that is out of sync.

You might accept equal keys for non-unique indexes, but I don't
believe making that distinction in the code is worth the hassle.

Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
difference). So it should say "rightmost leaf page".

right.

[...]

Setting "offset = InvalidOffsetNumber" in that contorted way is
unnecessary. You can remove the first assignment and instead
initialize unconditionally right after the fastpath block (that
doesn't make use of offset anyway):

Yes, I noticed that and it's confusing, Just set it at the top.

In indexing.out:
+explain select sum(a) from fastpath where a = 6456;
+                                     QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30
rows=1 width=4)
+         Index Cond: (a = 6456)
+(3 rows)
Having costs in explain tests can be fragile. Better use "explain
(costs off)". If you run "make check" continuously in a loop, you
should get some failures related to that pretty quickly.

Agree about costs off, but I'm fairly dubious of the value of using
EXPLAIN at all here.Nothing in the patch should result in any change
in EXPLAIN output.

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#44

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Andrew Dunstan (#43)

1 attachment(s)

Re: Faster inserts with mostly-monotonically increasing values

Hi Andrew and Claudio,

Thanks for the review!

On Fri, Mar 23, 2018 at 10:19 AM, Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

On Fri, Mar 23, 2018 at 10:20 AM, Claudio Freire <klaussfreire@gmail.com>
wrote:

This patch looks in pretty good shape. I have been trying hard to
think of some failure mode but haven't been able to come up with one.

Great!

Some comments
+    /*
+     * It's very common to have an index on an auto-incremented or
+     * monotonically increasing value. In such cases, every insertion
happens

+ * towards the end of the index. We try to optimise that case by

caching

+ * the right-most block of the index. If our cached block is still

the

+ * rightmost block, has enough free space to accommodate a new

entry and

+ * the insertion key is greater or equal to the first key in this

page,

+ * then we can safely conclude that the new key will be inserted in

the

+ * cached block. So we simply search within the cached page and

insert the

+ * key at the appropriate location. We call it a fastpath.

It should say "the insertion key is strictly greater than the first key"

Good catch. Fixed.

Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
difference). So it should say "rightmost leaf page".

right.

Fixed.

[...]

Setting "offset = InvalidOffsetNumber" in that contorted way is
unnecessary. You can remove the first assignment and instead
initialize unconditionally right after the fastpath block (that
doesn't make use of offset anyway):

Yes, I noticed that and it's confusing, Just set it at the top.

Good idea. Changed that way.

Having costs in explain tests can be fragile. Better use "explain
(costs off)". If you run "make check" continuously in a loop, you
should get some failures related to that pretty quickly.

Agree about costs off, but I'm fairly dubious of the value of using
EXPLAIN at all here.Nothing in the patch should result in any change
in EXPLAIN output.

I agree. I initially added EXPLAIN to ensure that we're doing index-only
scans. But you're correct, we don't need them in the tests itself.

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.

I changed the regression tests to include a few more scenarios, basically
using multi-column indexes in different ways and they querying rows by
ordering rows in different ways. I did not take away the vacuum and I
believe it will actually help the tests by introducing some fuzziness in
the tests i.e. if the vacuum does not do its job, we might execute a
different plan and ensure that the output remains unchanged.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_btree_target_block_v4.patchapplication/octet-stream; name=pg_btree_target_block_v4.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2fe9867353..2f916bf9c7 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -23,6 +23,7 @@
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/smgr.h"
 #include "utils/tqual.h"
 
 
@@ -85,7 +86,6 @@ static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
-
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
@@ -111,32 +111,121 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		is_unique = false;
 	int			natts = rel->rd_rel->relnatts;
 	ScanKey		itup_scankey;
-	BTStack		stack;
+	BTStack		stack = NULL;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool		fastpath;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
 
+	/*
+	 * It's very common to have an index on an auto-incremented or
+	 * monotonically increasing value. In such cases, every insertion happens
+	 * towards the end of the index. We try to optimise that case by caching
+	 * the right-most leaf of the index. If our cached block is still the
+	 * rightmost leaf, has enough free space to accommodate a new entry and
+	 * the insertion key is strictly greater than the first key in this page,
+	 * then we can safely conclude that the new key will be inserted in the
+	 * cached block. So we simply search within the cached block and insert the
+	 * key at the appropriate location. We call it a fastpath.
+	 *
+	 * Testing has revealed though that the fastpath can result in increased
+	 * contention on the exclusive-lock on the rightmost leaf page. So we
+	 * conditionally check if the lock is available. If it's not available then
+	 * we simply abandon the fastpath and take the regular path. This makes
+	 * sense because unavailability of the lock also signals that some other
+	 * backend might be concurrently inserting into the page, thus reducing our
+	 * chances to finding an insertion place in this page.
+	 */
 top:
-	/* find the first page containing this key */
-	stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE, NULL);
-
+	fastpath = false;
 	offset = InvalidOffsetNumber;
+	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
+	{
+		Size 			itemsz;
+		Page			page;
+		BTPageOpaque	lpageop;
 
-	/* trade in our read lock for a write lock */
-	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	LockBuffer(buf, BT_WRITE);
+		/*
+		 * Conditionally acquire exclusive lock on the buffer before doing any
+		 * checks. If we don't get the lock, we simply follow slowpath. If we
+		 * do get the lock, this ensures that the index state cannot change, as
+		 * far as the rightmost part of the index is concerned.
+		 */
+		buf = ReadBuffer(rel, RelationGetTargetBlock(rel));
 
-	/*
-	 * If the page was split between the time that we surrendered our read
-	 * lock and acquired our write lock, then this page may no longer be the
-	 * right place for the key we want to insert.  In this case, we need to
-	 * move right in the tree.  See Lehman and Yao for an excruciatingly
-	 * precise description.
-	 */
-	buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
-						true, stack, BT_WRITE, NULL);
+		if (ConditionalLockBuffer(buf))
+		{
+			_bt_checkpage(rel, buf);
+
+			page = BufferGetPage(buf);
+
+			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			itemsz = IndexTupleSize(itup);
+			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this
+										 * but we need to be consistent */
+
+			/*
+			 * Check if the page is still the rightmost leaf page, has enough
+			 * free space to accommodate the new tuple, no split is in progress
+			 * and the scankey is greater than or equal to the first key on the
+			 * page.
+			 */
+			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+					!P_INCOMPLETE_SPLIT(lpageop) &&
+					!P_IGNORE(lpageop) &&
+					(PageGetFreeSpace(page) > itemsz) &&
+					PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+					_bt_compare(rel, natts, itup_scankey, page,
+						P_FIRSTDATAKEY(lpageop)) > 0)
+			{
+				fastpath = true;
+			}
+			else
+			{
+				_bt_relbuf(rel, buf);
+
+				/*
+				 * Something did not workout. Just forget about the cached
+				 * block and follow the normal path. It might be set again if
+				 * the conditions are favourble.
+				 */
+				RelationSetTargetBlock(rel, InvalidBlockNumber);
+			}
+		}
+		else
+		{
+			ReleaseBuffer(buf);
+
+			/*
+			 * If someone's holding a lock, it's likely to change anyway,
+			 * so don't try again until we get an updated rightmost leaf.
+			 */
+			RelationSetTargetBlock(rel, InvalidBlockNumber);
+		}
+	}
+
+	if (!fastpath)
+	{
+		/* find the first page containing this key */
+		stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE,
+						   NULL);
+
+		/* trade in our read lock for a write lock */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buf, BT_WRITE);
+
+		/*
+		 * If the page was split between the time that we surrendered our read
+		 * lock and acquired our write lock, then this page may no longer be
+		 * the right place for the key we want to insert.  In this case, we
+		 * need to move right in the tree.  See Lehman and Yao for an
+		 * excruciatingly precise description.
+		 */
+		buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+							true, stack, BT_WRITE, NULL);
+	}
 
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
@@ -184,7 +273,8 @@ top:
 				XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);
 
 			/* start over... */
-			_bt_freestack(stack);
+			if (stack)
+				_bt_freestack(stack);
 			goto top;
 		}
 	}
@@ -211,7 +301,8 @@ top:
 	}
 
 	/* be tidy */
-	_bt_freestack(stack);
+	if (stack)
+		_bt_freestack(stack);
 	_bt_freeskey(itup_scankey);
 
 	return is_unique;
@@ -879,7 +970,16 @@ _bt_insertonpg(Relation rel,
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
 			if (P_ISLEAF(lpageop))
+			{
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+
+				/*
+				 * Cache the block information if we just inserted into the
+				 * rightmost leaf page of the index.
+				 */
+				if (P_RIGHTMOST(lpageop))
+					RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
+			}
 			else
 			{
 				/*
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index fb32ffffdc..ecd2b110ff 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1067,6 +1067,255 @@ select tableoid::regclass, * from idxpart order by a;
 (8 rows)
 
 drop table idxpart;
+-- test fastpath mechanism for index insertion
+create table fastpath (a int, b text, c numeric);
+create unique index fpindex1 on fastpath(a);
+insert into fastpath values (1, 'b1', 100.00);
+insert into fastpath values (1, 'b1', 100.00); -- unique key check
+ERROR:  duplicate key value violates unique constraint "fpindex1"
+DETAIL:  Key (a)=(1) already exists.
+truncate fastpath;
+insert into fastpath select generate_series(1,10000), 'b', 100;
+-- vacuum the table so as to improve chances of index-only scans. we can't
+-- guarantee if index-only scans will be picked up in all cases or not, but
+-- that fuzziness actually helps the test.
+vacuum fastpath;
+set enable_seqscan to false;
+set enable_bitmapscan to false;
+explain select sum(a) from fastpath where a = 6456;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30 rows=1 width=4)
+         Index Cond: (a = 6456)
+(3 rows)
+
+explain select sum(a) from fastpath where a >= 5000 and a < 5700;
+                                     QUERY PLAN                                      
+-------------------------------------------------------------------------------------
+ Aggregate  (cost=5.41..5.42 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..5.29 rows=50 width=4)
+         Index Cond: ((a >= 5000) AND (a < 5700))
+(3 rows)
+
+select sum(a) from fastpath where a = 6456;
+ sum  
+------
+ 6456
+(1 row)
+
+select sum(a) from fastpath where a >= 5000 and a < 5700;
+   sum   
+---------
+ 3744650
+(1 row)
+
+-- drop the only index on the table and compute hashes for
+-- a few queries which orders the results in various different ways.
+drop index fpindex1;
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ dfcf2bd5e5fea8397d47b2fd14618d31
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+-- now create a multi-column index with both column asc
+create index fpindex2 on fastpath(a, b);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+-- again, vacuum here either forces index-only scans or creates fuzziness
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ dfcf2bd5e5fea8397d47b2fd14618d31
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+-- same queries with a different kind of index now. the final result must not
+-- change irrespective of what kind of index we have.
+drop index fpindex2;
+create index fpindex3 on fastpath(a desc, b asc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ dfcf2bd5e5fea8397d47b2fd14618d31
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+-- repeat again
+drop index fpindex3;
+create index fpindex4 on fastpath(a asc, b desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ dfcf2bd5e5fea8397d47b2fd14618d31
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+-- and again, this time indexing by (b, a). Note that column "b" has non-unique
+-- values.
+drop index fpindex4;
+create index fpindex5 on fastpath(b asc, a desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ dfcf2bd5e5fea8397d47b2fd14618d31
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+-- one last time
+drop index fpindex5;
+create index fpindex6 on fastpath(b desc, a desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 6167a852b3e0679886b84a5405b5b53d
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ dfcf2bd5e5fea8397d47b2fd14618d31
+(1 row)
+
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+               md5                
+----------------------------------
+ 2ca216010a558a52d7df12f76dfc77ab
+(1 row)
+
+drop table fastpath;
 -- intentionally leave some objects around
 create table idxpart (a int) partition by range (a);
 create table idxpart1 partition of idxpart for values from (0) to (100);
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index c4ab89fc48..80f4adc6aa 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -574,6 +574,122 @@ insert into idxpart values (857142, 'six');
 select tableoid::regclass, * from idxpart order by a;
 drop table idxpart;
 
+-- test fastpath mechanism for index insertion
+create table fastpath (a int, b text, c numeric);
+create unique index fpindex1 on fastpath(a);
+
+insert into fastpath values (1, 'b1', 100.00);
+insert into fastpath values (1, 'b1', 100.00); -- unique key check
+
+truncate fastpath;
+insert into fastpath select generate_series(1,10000), 'b', 100;
+
+-- vacuum the table so as to improve chances of index-only scans. we can't
+-- guarantee if index-only scans will be picked up in all cases or not, but
+-- that fuzziness actually helps the test.
+vacuum fastpath;
+
+set enable_seqscan to false;
+set enable_bitmapscan to false;
+
+explain select sum(a) from fastpath where a = 6456;
+explain select sum(a) from fastpath where a >= 5000 and a < 5700;
+select sum(a) from fastpath where a = 6456;
+select sum(a) from fastpath where a >= 5000 and a < 5700;
+
+-- drop the only index on the table and compute hashes for
+-- a few queries which orders the results in various different ways.
+drop index fpindex1;
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+-- now create a multi-column index with both column asc
+create index fpindex2 on fastpath(a, b);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+-- again, vacuum here either forces index-only scans or creates fuzziness
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+-- same queries with a different kind of index now. the final result must not
+-- change irrespective of what kind of index we have.
+drop index fpindex2;
+create index fpindex3 on fastpath(a desc, b asc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+-- repeat again
+drop index fpindex3;
+create index fpindex4 on fastpath(a asc, b desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+-- and again, this time indexing by (b, a). Note that column "b" has non-unique
+-- values.
+drop index fpindex4;
+create index fpindex5 on fastpath(b asc, a desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+-- one last time
+drop index fpindex5;
+create index fpindex6 on fastpath(b desc, a desc);
+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from (select generate_series(1,10000) as x) y;
+vacuum fastpath;
+select md5(string_agg(a::text, b order by a, b asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by a desc, b desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a desc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+select md5(string_agg(a::text, b order by b, a asc)) from fastpath
+	where a >= 1000 and a < 2000 and b > 'b1' and b < 'b3';
+
+drop table fastpath;
+
 -- intentionally leave some objects around
 create table idxpart (a int) partition by range (a);
 create table idxpart1 partition of idxpart for values from (0) to (100);

#45

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

almost 8 years ago

In reply to: Pavan Deolasee (#44)

Re: Faster inserts with mostly-monotonically increasing values

On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.

I changed the regression tests to include a few more scenarios, basically
using multi-column indexes in different ways and they querying rows by
ordering rows in different ways. I did not take away the vacuum and I
believe it will actually help the tests by introducing some fuzziness in the
tests i.e. if the vacuum does not do its job, we might execute a different
plan and ensure that the output remains unchanged.

If we're going to keep the vacuums in there, do we need to add a wait
barrier like Claudio suggested upthread?

Once we decide on that I propose to commit this.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#46

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Andrew Dunstan (#45)

Re: Faster inserts with mostly-monotonically increasing values

On Sun, Mar 25, 2018 at 6:00 AM, Andrew Dunstan <
andrew.dunstan@2ndquadrant.com> wrote:

On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.

I changed the regression tests to include a few more scenarios, basically
using multi-column indexes in different ways and they querying rows by
ordering rows in different ways. I did not take away the vacuum and I
believe it will actually help the tests by introducing some fuzziness in

the

tests i.e. if the vacuum does not do its job, we might execute a

different

plan and ensure that the output remains unchanged.

If we're going to keep the vacuums in there, do we need to add a wait
barrier like Claudio suggested upthread?

I don't think we need the wait barrier since we're no longer printing the
explain plan. In the worst case, the vacuum may not get to set pages
all-visible, thus planner choosing something other than an index-only-scan,
but I guess that fuzziness actually helps the regression tests. That way we
get confirmation regarding the final result irrespective of the plan
chosen.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#47

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

almost 8 years ago

In reply to: Pavan Deolasee (#46)

Re: Faster inserts with mostly-monotonically increasing values

On Mon, Mar 26, 2018 at 6:06 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Sun, Mar 25, 2018 at 6:00 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

On Fri, Mar 23, 2018 at 8:27 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I would probably just have a few regression lines that should be sure
to exercise the code path and leave it at that.

I changed the regression tests to include a few more scenarios,
basically
using multi-column indexes in different ways and they querying rows by
ordering rows in different ways. I did not take away the vacuum and I
believe it will actually help the tests by introducing some fuzziness in
the
tests i.e. if the vacuum does not do its job, we might execute a
different
plan and ensure that the output remains unchanged.

If we're going to keep the vacuums in there, do we need to add a wait
barrier like Claudio suggested upthread?

I don't think we need the wait barrier since we're no longer printing the
explain plan. In the worst case, the vacuum may not get to set pages
all-visible, thus planner choosing something other than an index-only-scan,
but I guess that fuzziness actually helps the regression tests. That way we
get confirmation regarding the final result irrespective of the plan chosen.

Fair enough. Committed.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#48

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Andrew Dunstan (#47)

Re: Faster inserts with mostly-monotonically increasing values

/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
...

if (P_ISLEAF(lpageop))
{
xlinfo = XLOG_BTREE_INSERT_LEAF;

/*
* Cache the block information if we just inserted into the
* rightmost leaf page of the index.
*/
if (P_RIGHTMOST(lpageop))
RelationSetTargetBlock(rel, BufferGetBlockNumber(buf));
}
...

Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
ISTM that we're failing to take advantage of this optimization for
unlogged tables, for no particular reason. Just an oversight?

- Heikki

#49

Claudio Freire

klaussfreire@gmail.com

almost 8 years ago

In reply to: Heikki Linnakangas (#48)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
...

if (P_ISLEAF(lpageop))
{
xlinfo = XLOG_BTREE_INSERT_LEAF;

/*
* Cache the block information if we just
inserted into the
* rightmost leaf page of the index.
*/
if (P_RIGHTMOST(lpageop))
RelationSetTargetBlock(rel,
BufferGetBlockNumber(buf));
}
...

Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
ISTM that we're failing to take advantage of this optimization for unlogged
tables, for no particular reason. Just an oversight?

- Heikki

Indeed.

Maybe Pavan knows of one, but I don't see any reason not to apply this
to unlogged tables as well. It slipped the review.

#50

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Claudio Freire (#49)

Re: Faster inserts with mostly-monotonically increasing values

On Tue, Apr 10, 2018 at 7:49 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
ISTM that we're failing to take advantage of this optimization for

unlogged

tables, for no particular reason. Just an oversight?

- Heikki

Indeed.

Maybe Pavan knows of one, but I don't see any reason not to apply this
to unlogged tables as well. It slipped the review.

Yes, looks like an oversight :-( I will fix it along with the other changes
that Peter requested.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services