btree: downlink right separator/HIKEY optimization

Started by Matthias van de Meentabout 2 years ago8 messages

boekewurm+postgres@gmail.com

about 2 years ago

1 attachment(s)

(now really to -hackers)
Hi,

Over at [0]/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com I'd implemented an optimization that allows us to skip
calling _bt_compare in _bt_moveright in many common cases. This patch,
when stacked on top of the prefix truncation patch, improves INSERT
performance by an additional 2-9%pt, with an extreme case of 45% in
the worscase index tests at [0]/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com.

The optimization is that we now recognze that our page split algorithm
all but guarantees that the HIKEY matches this page's downlink's right
separator key bytewise, excluding the data stored in the
IndexTupleData struct.

By caching the right separator index tuple in _bt_search, we can
compare the downlink's right separator and the HIKEY, and when they
are equal (memcmp() == 0) we don't have to call _bt_compare - the
HIKEY is known to be larger than the scan key, because our key is
smaller than the right separator, and thus transitively also smaller
than the HIKEY because it contains the same data. As _bt_compare can
call expensive user-provided functions, this can be a large
performance boon, especially when there are only a small number of
column getting compared on each page (e.g. index tuples of many 100s
of bytes, or dynamic prefix truncation is enabled).

By adding this, the number of _bt_compare calls per _bt_search is
often reduced by one per btree level.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with dynamic prefix truncation [1]/messages/by-id/CAEze2Wh-h20DmPSMXp4qHR0-ykh9=Z3ejX8MSsbikbOqaYe_OQ@mail.gmail.com and btree specialization [0]/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com.

[0]: /messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com
[1]: /messages/by-id/CAEze2Wh-h20DmPSMXp4qHR0-ykh9=Z3ejX8MSsbikbOqaYe_OQ@mail.gmail.com

Attachments:

v1-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchapplication/x-patch; name=v1-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchDownload

From 55a2d06037f530b6d79bc73ed21bd27b78a1cc53 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v1] btree: optimize _bt_moveright using downlink's right
 separator

Due to the inner workings of the btree, it is extremely likely that the
right separator of the btree page's downlink on the parent page matches
the downlinked page's HIKEY: Only when the page was split after we
accessed its parent page (and this split wasn't completed yet) the HIKEY
will not match.

So, instead of doing _bt_compare in _bt_moveright, we can store the
right separator key of the downlink, and check if it matches the HIKEY of
the linked page. If they match, we don't have to call the relatively
expensive _bt_compare, which allows us to reuse the _bt_compare result of
the right separator key.

This reduces the number of all-attribute _bt_compare operations
we need to do on a page by 1 flat, thus increasing performance
of indexes that have expensive compare operations.

In passing, move the declaration of _bt_moveright into nbtsearch.c - I
found no user of the function anywhere but in nbtsearch.c.
---
 src/backend/access/nbtree/README      | 20 +++++++++
 src/backend/access/nbtree/nbtsearch.c | 65 +++++++++++++++++++++++++--
 src/include/access/nbtree.h           |  3 --
 3 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..c75793da5a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,26 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about the right separator/HIKEY moveright optimization
+------------------------------------------------------------
+
+Page splits consistently cause the HIKEY of the split page to be inserted
+into the parent page as the right separator of the page's downlink (or,
+as the split page's new right sibling's left link), with the only
+difference between the HIKEY and the separator being the contents of the
+IndexTupleData struct of these tuples: the payloads are bit-identical.
+This allows us to reuse the _bt_compare result of the right separator key
+for the downlinked page's HIKEY, if we can determine that those are indeed
+bit-identical: Concurrent page splits and deletions may have caused the
+downlinked page to get a different HIKEY, which could have a different
+_bt_compare result.  To make this work, in _bt_search we cache the
+current downlink's right separator key, to then in the _bt_moveright
+phase of the layer below use memcmp() to validate our assumptions about
+the HIKEY matching the downlink's right separator key.  Only if the
+assumption is proven wrong (memcmp(HIKEY, right_sep) != 0), we call
+_bt_compare(), otherwise we can be certain that the parent page's result
+is unchanged, i.e. that _bt_compare would return "<".
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5b..602a0f45e1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, IndexTuple rightsep);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,16 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	union {
+		char	data[MAXALIGN_DOWN(BLCKSZ / 3)];
+		IndexTupleData tuple;
+		double	force_align_d;
+	}			rsepbuf
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(MAXIMUM_ALIGNOF)
+#endif
+	;
+	IndexTuple	rightsep = NULL;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +147,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, rightsep);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -152,6 +165,25 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		child = BTreeTupleGetDownLink(itup);
 
+		if (offnum < PageGetMaxOffsetNumber(page))
+		{
+			ItemId	rightsepitem = PageGetItemId(page, offnum + 1);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+		else if (!P_RIGHTMOST(opaque))
+		{
+			/*
+			 * The rightmost data tuple on inner page has P_HIKEY as its right
+			 * separator.
+			 */
+			ItemId	rightsepitem = PageGetItemId(page, P_HIKEY);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+
 		/*
 		 * We need to save the location of the pivot tuple we chose in a new
 		 * stack entry for this page/level.  If caller ends up splitting a
@@ -194,7 +226,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  rightsep);
 	}
 
 	return stack_in;
@@ -238,7 +271,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  IndexTuple rightsep)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -297,7 +331,30 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque))
+		{
+			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+			continue;
+		}
+
+		if (PointerIsValid(rightsep))
+		{
+			/*
+			 * Note: we're not in the rightmost page (see branchpoint earlier in
+			 * the loop), so we always have a HIKEY on this page.
+			 */
+			ItemId	hikeyid = PageGetItemId(page, P_HIKEY);
+			IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
+
+			if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) &&
+				memcmp(&highkey[1], &rightsep[1],
+					   IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 0)
+			{
+				break;
+			}
+		}
+
+		if (_bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086c..fa7fc03362 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1237,9 +1237,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-- 
2.40.1

Import Notes

Reply to msg id not found: CAEze2Wh9b9iu0APbrpAcBFuVSfcAzgO665cObS+WZgn3tWD5VA@mail.gmail.comReference msg id not found: CAEze2Wh9b9iu0APbrpAcBFuVSfcAzgO665cObS+WZgn3tWD5VA@mail.gmail.com

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Matthias van de Meent (#1)

Re: btree: downlink right separator/HIKEY optimization

On 01/11/2023 00:08, Matthias van de Meent wrote:

calling _bt_compare in _bt_moveright in many common cases. This patch,
when stacked on top of the prefix truncation patch, improves INSERT
performance by an additional 2-9%pt, with an extreme case of 45% in
the worscase index tests at [0].

The optimization is that we now recognze that our page split algorithm
all but guarantees that the HIKEY matches this page's downlink's right
separator key bytewise, excluding the data stored in the
IndexTupleData struct.

Good observation.

By caching the right separator index tuple in _bt_search, we can
compare the downlink's right separator and the HIKEY, and when they
are equal (memcmp() == 0) we don't have to call _bt_compare - the
HIKEY is known to be larger than the scan key, because our key is
smaller than the right separator, and thus transitively also smaller
than the HIKEY because it contains the same data. As _bt_compare can
call expensive user-provided functions, this can be a large
performance boon, especially when there are only a small number of
column getting compared on each page (e.g. index tuples of many 100s
of bytes, or dynamic prefix truncation is enabled).

What would be the worst case scenario for this? One situation where the
memcmp() would not match is when there is a concurrent page split. I
think it's OK to pessimize that case. Are there any other situations?
When the memcmp() matches, I think this is almost certainly not slower
than calling the datatype's comparison function.

if (offnum < PageGetMaxOffsetNumber(page))
{
ItemId rightsepitem = PageGetItemId(page, offnum + 1);
IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
rightsep = &rsepbuf.tuple;
}
else if (!P_RIGHTMOST(opaque))
{
/*
* The rightmost data tuple on inner page has P_HIKEY as its right
* separator.
*/
ItemId rightsepitem = PageGetItemId(page, P_HIKEY);
IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
rightsep = &rsepbuf.tuple;
}

This could use a one-line comment above this, something like "Remember
the right separator of the downlink we follow, to speed up the next
_bt_moveright call".

Should there be an "else rightsep = NULL;" here? Is it possible that we
follow the non-rightmost downlink on a higher level and rightmost
downlink on next level? Concurrent page deletion?

Please update the comment above _bt_moveright to describe the new
argument. Perhaps the text from README should go there, this feels like
a detail specific to _bt_search and _bt_moveright.

--
Heikki Linnakangas
Neon (https://neon.tech)

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Matthias van de Meent (#1)

Re: btree: downlink right separator/HIKEY optimization

On Wed, 1 Nov 2023 at 03:38, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

(now really to -hackers)
Hi,

Over at [0] I'd implemented an optimization that allows us to skip
calling _bt_compare in _bt_moveright in many common cases. This patch,
when stacked on top of the prefix truncation patch, improves INSERT
performance by an additional 2-9%pt, with an extreme case of 45% in
the worscase index tests at [0].

The optimization is that we now recognze that our page split algorithm
all but guarantees that the HIKEY matches this page's downlink's right
separator key bytewise, excluding the data stored in the
IndexTupleData struct.

By caching the right separator index tuple in _bt_search, we can
compare the downlink's right separator and the HIKEY, and when they
are equal (memcmp() == 0) we don't have to call _bt_compare - the
HIKEY is known to be larger than the scan key, because our key is
smaller than the right separator, and thus transitively also smaller
than the HIKEY because it contains the same data. As _bt_compare can
call expensive user-provided functions, this can be a large
performance boon, especially when there are only a small number of
column getting compared on each page (e.g. index tuples of many 100s
of bytes, or dynamic prefix truncation is enabled).

By adding this, the number of _bt_compare calls per _bt_search is
often reduced by one per btree level.

CFBot shows the following compilation error at [1]https://cirrus-ci.com/task/4634619035779072:
[16:56:22.153] FAILED:
src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj
[16:56:22.153] "cl" "-Isrc\backend\postgres_lib.a.p" "-Isrc\include"
"-I..\src\include" "-Ic:\openssl\1.1\include"
"-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc"
"/MDd" "/FIpostgres_pch.h" "/Yupostgres_pch.h"
"/Fpsrc\backend\postgres_lib.a.p\postgres_pch.pch" "/nologo"
"/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/DWIN32" "/DWINDOWS"
"/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE"
"/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101"
"/wd4102" "/wd4090" "/wd4267" "-DBUILDING_DLL" "/FS"
"/FdC:\cirrus\build\src\backend\postgres_lib.pdb"
/Fosrc/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj "/c"
../src/backend/access/nbtree/nbtsearch.c
[16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error
C2143: syntax error: missing ';' before 'type'
[16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning
C4091: ' ': ignored on left of 'int' when no variable is declared

[1]: https://cirrus-ci.com/task/4634619035779072

Regards,
Vignesh

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#2)

1 attachment(s)

Re: btree: downlink right separator/HIKEY optimization

On Tue, 5 Dec 2023 at 08:43, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 01/11/2023 00:08, Matthias van de Meent wrote:

By caching the right separator index tuple in _bt_search, we can
compare the downlink's right separator and the HIKEY, and when they
are equal (memcmp() == 0) we don't have to call _bt_compare - the
HIKEY is known to be larger than the scan key, because our key is
smaller than the right separator, and thus transitively also smaller
than the HIKEY because it contains the same data. As _bt_compare can
call expensive user-provided functions, this can be a large
performance boon, especially when there are only a small number of
column getting compared on each page (e.g. index tuples of many 100s
of bytes, or dynamic prefix truncation is enabled).

What would be the worst case scenario for this? One situation where the
memcmp() would not match is when there is a concurrent page split. I
think it's OK to pessimize that case. Are there any other situations?

There is also concurrent page deletion which can cause downlinked
pages to get removed from the set of accessible pages, but that's
quite rare, too: arguably even more rare than page splits.

When the memcmp() matches, I think this is almost certainly not slower
than calling the datatype's comparison function.

if (offnum < PageGetMaxOffsetNumber(page))
[...]
else if (!P_RIGHTMOST(opaque))
[...]
}

This could use a one-line comment above this, something like "Remember
the right separator of the downlink we follow, to speed up the next
_bt_moveright call".

Done.

Should there be an "else rightsep = NULL;" here? Is it possible that we
follow the non-rightmost downlink on a higher level and rightmost
downlink on next level? Concurrent page deletion?

While possible, the worst this could do is be less efficient in those
fringe cases: The cached right separator is a key that is known to
compare larger than the search key and thus always correct to use as
an optimization for "is this HIKEY larger than my search key", as long
as we don't clobber the data in that cache (which we don't).
Null-ing the argument, while not incorrect, could be argued to be
worse than useless here, as the only case where NULL may match an
actual highkey is on the rightmost page, which we already special-case
in _bt_moveright before hitting the 'compare the highkey' code.
Removal of the value would thus remove any chance of using the
optimization after hitting the rightmost page in a layer below.

I've added a comment to explain this in an empty else block in the
attached version 2 of the patch.

Please update the comment above _bt_moveright to describe the new
argument. Perhaps the text from README should go there, this feels like
a detail specific to _bt_search and _bt_moveright.

Done.

Thank you for the review.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v2-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchapplication/octet-stream; name=v2-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchDownload

From 45be6783cd2748c2427ca50ae2b38236a9c60a48 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v2] btree: optimize _bt_moveright using downlink's right
 separator

Due to the inner workings of the btree, it is extremely likely that the
right separator of the btree page's downlink on the parent page matches
the downlinked page's HIKEY: Only when the page was split after we
accessed its parent page (and this split wasn't completed yet) the HIKEY
will not match.

So, instead of doing _bt_compare in _bt_moveright, we can store the
right separator key of the downlink, and check if it matches the HIKEY of
the linked page. If they match, we don't have to call the relatively
expensive _bt_compare, which allows us to reuse the _bt_compare result of
the right separator key.

This reduces the number of all-attribute _bt_compare operations
we need to do on a page by 1 flat, thus increasing performance
of indexes that have expensive compare operations.

In passing, move the declaration of _bt_moveright into nbtsearch.c - I
found no user of the function anywhere but in nbtsearch.c.
---
 src/backend/access/nbtree/README      | 20 ++++++
 src/backend/access/nbtree/nbtsearch.c | 91 +++++++++++++++++++++++++--
 src/include/access/nbtree.h           |  3 -
 3 files changed, 107 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..c75793da5a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,26 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about the right separator/HIKEY moveright optimization
+------------------------------------------------------------
+
+Page splits consistently cause the HIKEY of the split page to be inserted
+into the parent page as the right separator of the page's downlink (or,
+as the split page's new right sibling's left link), with the only
+difference between the HIKEY and the separator being the contents of the
+IndexTupleData struct of these tuples: the payloads are bit-identical.
+This allows us to reuse the _bt_compare result of the right separator key
+for the downlinked page's HIKEY, if we can determine that those are indeed
+bit-identical: Concurrent page splits and deletions may have caused the
+downlinked page to get a different HIKEY, which could have a different
+_bt_compare result.  To make this work, in _bt_search we cache the
+current downlink's right separator key, to then in the _bt_moveright
+phase of the layer below use memcmp() to validate our assumptions about
+the HIKEY matching the downlink's right separator key.  Only if the
+assumption is proven wrong (memcmp(HIKEY, right_sep) != 0), we call
+_bt_compare(), otherwise we can be certain that the parent page's result
+is unchanged, i.e. that _bt_compare would return "<".
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba225..65ea4f295c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, IndexTuple rightsep);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,16 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	union {
+		char	data[MAXALIGN_DOWN(BLCKSZ / 3)];
+		IndexTupleData tuple;
+		double	force_align_d;
+	}			rsepbuf
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(MAXIMUM_ALIGNOF)
+#endif
+	;
+	IndexTuple	rightsep = NULL;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +147,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, rightsep);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -152,6 +165,44 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		child = BTreeTupleGetDownLink(itup);
 
+		/*
+		 * Remember the right separator to speed up _moveright calls in the
+		 * next iteration.
+		 */
+		if (offnum < PageGetMaxOffsetNumber(page))
+		{
+			ItemId	rightsepitem = PageGetItemId(page, offnum + 1);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+		else if (!P_RIGHTMOST(opaque))
+		{
+			/*
+			 * The rightmost data tuple on inner page has P_HIKEY as its right
+			 * separator.
+			 */
+			ItemId	rightsepitem = PageGetItemId(page, P_HIKEY);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+		else
+		{
+			/*
+			 * The rightmost page has no high key/right separator for the
+			 * downlink to its rightmost child page. While we _could_ set
+			 * rightsep to NULL, it would be of marginal gain:
+			 * We've likely gotten to this page through a chain of this page's
+			 * parents (all rightmost pages) so the value is still NULL.
+			 * Alternatively, concurrent page deletions got us onto this page,
+			 * in which case we have a valid rightsep; in which case we can
+			 * reuse it in further downlinks (whenever applicable): We do
+			 * not rely on the rightsep value on rightmost pages at any
+			 * level.
+			 */
+		}
+
 		/*
 		 * We need to save the location of the pivot tuple we chose in a new
 		 * stack entry for this page/level.  If caller ends up splitting a
@@ -194,7 +245,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  rightsep);
 	}
 
 	return stack_in;
@@ -227,6 +279,13 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
  * insertion, because we don't allow inserting on a page before the split is
  * completed.  'heaprel' and 'stack' are only used if forupdate is true.
  *
+ * If rightsep is non-NULL, it contains a buffered IndexTuple that was the
+ * right separator for the downlink used to get to this level in the btree.
+ * With the current split logic and without a concurrent split, this should
+ * have the same tuple data payload as the HIKEY of the page in the buffer,
+ * in which case we can stop moving right for the scankey:
+ * HIKEY == rightsep, and scankey < rightsep, thus scankey < HIKEY.
+ *
  * On entry, we have the buffer pinned and a lock of the type specified by
  * 'access'.  If we move right, we release the buffer and lock and acquire
  * the same on the right sibling.  Return value is the buffer we stop at.
@@ -238,7 +297,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  IndexTuple rightsep)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -297,7 +357,30 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque))
+		{
+			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+			continue;
+		}
+
+		if (PointerIsValid(rightsep))
+		{
+			/*
+			 * Note: we're not in the rightmost page (see branchpoint earlier in
+			 * the loop), so we always have a HIKEY on this page.
+			 */
+			ItemId	hikeyid = PageGetItemId(page, P_HIKEY);
+			IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
+
+			if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) &&
+				memcmp(&highkey[1], &rightsep[1],
+					   IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 0)
+			{
+				break;
+			}
+		}
+
+		if (_bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..de524ed188 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1229,9 +1229,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-- 
2.40.1

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: vignesh C (#3)

1 attachment(s)

Re: btree: downlink right separator/HIKEY optimization

On Sat, 6 Jan 2024 at 16:40, vignesh C <vignesh21@gmail.com> wrote:

CFBot shows the following compilation error at [1]:
[16:56:22.153] FAILED:
src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj
[...]
../src/backend/access/nbtree/nbtsearch.c
[16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error
C2143: syntax error: missing ';' before 'type'
[16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning
C4091: ' ': ignored on left of 'int' when no variable is declared

I forgot to address this in the previous patch, so here's v3 which
fixes the issue warning.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchapplication/octet-stream; name=v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchDownload

From 7b8814d1bdab38a3871e20c98a4b23bbc1fe23fd Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v3] btree: optimize _bt_moveright using downlink's right
 separator

Due to the inner workings of the btree, it is extremely likely that the
right separator of the btree page's downlink on the parent page matches
the downlinked page's HIKEY: Only when the page was split after we
accessed its parent page (and this split wasn't completed yet) the HIKEY
will not match.

So, instead of doing _bt_compare in _bt_moveright, we can store the
right separator key of the downlink, and check if it matches the HIKEY of
the linked page. If they match, we don't have to call the relatively
expensive _bt_compare, which allows us to reuse the _bt_compare result of
the right separator key.

This reduces the number of all-attribute _bt_compare operations
we need to do on a page by 1 flat, thus increasing performance
of indexes that have expensive compare operations.

In passing, move the declaration of _bt_moveright into nbtsearch.c - I
found no user of the function anywhere but in nbtsearch.c.
---
 src/backend/access/nbtree/README      | 20 +++++++++
 src/backend/access/nbtree/nbtsearch.c | 64 +++++++++++++++++++++++++--
 src/include/access/nbtree.h           |  3 --
 3 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..c75793da5a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,26 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about the right separator/HIKEY moveright optimization
+------------------------------------------------------------
+
+Page splits consistently cause the HIKEY of the split page to be inserted
+into the parent page as the right separator of the page's downlink (or,
+as the split page's new right sibling's left link), with the only
+difference between the HIKEY and the separator being the contents of the
+IndexTupleData struct of these tuples: the payloads are bit-identical.
+This allows us to reuse the _bt_compare result of the right separator key
+for the downlinked page's HIKEY, if we can determine that those are indeed
+bit-identical: Concurrent page splits and deletions may have caused the
+downlinked page to get a different HIKEY, which could have a different
+_bt_compare result.  To make this work, in _bt_search we cache the
+current downlink's right separator key, to then in the _bt_moveright
+phase of the layer below use memcmp() to validate our assumptions about
+the HIKEY matching the downlink's right separator key.  Only if the
+assumption is proven wrong (memcmp(HIKEY, right_sep) != 0), we call
+_bt_compare(), otherwise we can be certain that the parent page's result
+is unchanged, i.e. that _bt_compare would return "<".
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba225..25037b7671 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, IndexTuple rightsep);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,15 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	union {
+#if defined(pg_attribute_aligned)
+		pg_attribute_aligned(MAXIMUM_ALIGNOF)
+#endif
+		char	data[MAXALIGN_DOWN(BLCKSZ / 3)];
+		IndexTupleData tuple;
+		double	force_align_d;
+	}			rsepbuf;
+	IndexTuple	rightsep = NULL;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +146,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, rightsep);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -152,6 +164,25 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		child = BTreeTupleGetDownLink(itup);
 
+		if (offnum < PageGetMaxOffsetNumber(page))
+		{
+			ItemId	rightsepitem = PageGetItemId(page, offnum + 1);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+		else if (!P_RIGHTMOST(opaque))
+		{
+			/*
+			 * The rightmost data tuple on inner page has P_HIKEY as its right
+			 * separator.
+			 */
+			ItemId	rightsepitem = PageGetItemId(page, P_HIKEY);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+
 		/*
 		 * We need to save the location of the pivot tuple we chose in a new
 		 * stack entry for this page/level.  If caller ends up splitting a
@@ -194,7 +225,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  rightsep);
 	}
 
 	return stack_in;
@@ -238,7 +270,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  IndexTuple rightsep)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -297,7 +330,30 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque))
+		{
+			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+			continue;
+		}
+
+		if (PointerIsValid(rightsep))
+		{
+			/*
+			 * Note: we're not in the rightmost page (see branchpoint earlier in
+			 * the loop), so we always have a HIKEY on this page.
+			 */
+			ItemId	hikeyid = PageGetItemId(page, P_HIKEY);
+			IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
+
+			if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) &&
+				memcmp(&highkey[1], &rightsep[1],
+					   IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 0)
+			{
+				break;
+			}
+		}
+
+		if (_bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..de524ed188 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1229,9 +1229,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-- 
2.40.1

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Matthias van de Meent (#5)

Re: btree: downlink right separator/HIKEY optimization

On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I forgot to address this in the previous patch, so here's v3 which
fixes the issue warning.

What benchmarking have you done here?

Have you tried just reordering things in _bt_search() instead? If we
delay the check until after the binary search, then the result of the
binary search is usually proof enough that we cannot possibly need to
move right. That has the advantage of not requiring that we copy
anything to the stack.

Admittedly, it's harder to make the "binary search first" approach
work on the leaf level, under the current code structure. But maybe
that doesn't matter very much. And even if it does matter, maybe we
should just move the call to _bt_binsrch() that currently takes place
in _bt_first into _bt_search() itself -- so that _bt_binsrch() is
strictly under the control of _bt_search() (obviously not doable for
non-_bt_first callers, which need to call _bt_binsrch_insert instead).
This whole approach will have been made easier by the refactoring I
did late last year, in commit c9c0589fda.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Peter Geoghegan (#6)

Re: btree: downlink right separator/HIKEY optimization

On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:

What benchmarking have you done here?

I think that the memcmp() test is subtly wrong:

+       if (PointerIsValid(rightsep))
+       {
+           /*
+            * Note: we're not in the rightmost page (see branchpoint earlier in
+            * the loop), so we always have a HIKEY on this page.
+            */
+           ItemId  hikeyid = PageGetItemId(page, P_HIKEY);
+           IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
+
+           if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) &&
+               memcmp(&highkey[1], &rightsep[1],
+                      IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 0)
+           {
+               break;
+           }
+       }

Unlike amcheck's bt_pivot_tuple_identical, your memcmp() does not
compare relevant metadata fields from struct IndexTupleData. It
wouldn't make sense for it to compare the block number, of course (if
it did then the optimization would simply never work), but ISTM that
you still need to compare ItemPointerData.ip_posid.

Suppose, for example, that you had two very similar pivot tuples from
a multicolumn index on (a int4, b int2) columns. The first/earlier
tuple is (a,b) = "(42, -inf)", due to the influence of suffix
truncation. The second/later tuple is (a,b) = "(42, 0)", since suffix
truncation couldn't take place when the second pivot tuple was
created. (Actually, suffix truncation would have been possible, but it
would have only managed to truncate-away the tiebreak heap TID
attribute value in the case of our second tuple).

There'll be more alignment padding (more zero padding bytes) in the
second tuple, compared to the first. But the tuples are still the same
size. When you go to you memcmp() this pair of tuples using the
approach in v3, the bytes that are actually compared will be
identical, despite the fact that these are really two distinct tuples,
with distinct values. (As I said, you'd have to actually compare the
ItemPointerData.ip_posid metadata to notice this small difference.)

--
Peter Geoghegan

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#6)

1 attachment(s)

Re: btree: downlink right separator/HIKEY optimization

On Fri, 8 Mar 2024 at 20:14, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I forgot to address this in the previous patch, so here's v3 which
fixes the issue warning.

What benchmarking have you done here?

I have benchmarked this atop various versions of master when it was
part of the btree specialization patchset, where it showed a 2-9%
increase in btree insert performance over the previous patch in the
patchset on the various index types in that set.
More recently, on an unlogged pgbench with foreign keys enabled (-s400
-j4 -c8) I can't find any obvious regressions (it gains 0-0.7% on
master across 5-minute runs), while being 4.5% faster on inserting
data on a table with an excessively bad index shape (single index of
10 columns of empty strings with the non-default "nl-BE-x-icu"
collation followed by 1 random uuid column, inserted from a 10M row
dataset. Extrapolation indicates this could indeed get over 7%
improvement when the index shape is 31 nondefault -collated nonnull
text columns and a single random ID index column).

Have you tried just reordering things in _bt_search() instead? If we
delay the check until after the binary search, then the result of the
binary search is usually proof enough that we cannot possibly need to
move right. That has the advantage of not requiring that we copy
anything to the stack.

I've not tried that, because it would makes page-level prefix
truncation more expensive by ~50%: With this patch, we need only 2
full tuple _bt_compares per page before we can establish a prefix,
while without this patch (i.e. if we did a binsrch-first approach)
we'd need 3 on average (assuming linearly randomly distributed
accesses). Because full-key compares can be arbitrarily more expensive
than normal attribute compares, I'd rather not have that 50% overhead.

On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:
What benchmarking have you done here?

I think that the memcmp() test is subtly wrong:

Good catch, it's been fixed in the attached version, using a new function.

Kind regards,

Matthias van de Meent.

Attachments:

v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchapplication/octet-stream; name=v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patchDownload

From e23e25c294d43679684919326f2ed3bc13f76b34 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v3] btree: optimize _bt_moveright using downlink's right
 separator

Due to the inner workings of the btree, it is extremely likely that the
right separator of the btree page's downlink on the parent page matches
the downlinked page's HIKEY: Only when the page was split after we
accessed its parent page (and this split wasn't completed yet) the HIKEY
will not match.

So, instead of doing _bt_compare in _bt_moveright, we can store the
right separator key of the downlink, and check if it matches the HIKEY of
the linked page. If they match, we don't have to call the relatively
expensive _bt_compare, which allows us to reuse the _bt_compare result of
the right separator key.

This reduces the number of all-attribute _bt_compare operations
we need to do on a page by 1 flat, thus increasing performance
of indexes that have expensive compare operations.

In passing, move the declaration of _bt_moveright into nbtsearch.c - I
found no user of the function anywhere but in nbtsearch.c.
---
 src/backend/access/nbtree/README      | 20 +++++++
 src/backend/access/nbtree/nbtsearch.c | 82 +++++++++++++++++++++++++--
 src/include/access/nbtree.h           |  3 -
 3 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..c75793da5a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,26 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about the right separator/HIKEY moveright optimization
+------------------------------------------------------------
+
+Page splits consistently cause the HIKEY of the split page to be inserted
+into the parent page as the right separator of the page's downlink (or,
+as the split page's new right sibling's left link), with the only
+difference between the HIKEY and the separator being the contents of the
+IndexTupleData struct of these tuples: the payloads are bit-identical.
+This allows us to reuse the _bt_compare result of the right separator key
+for the downlinked page's HIKEY, if we can determine that those are indeed
+bit-identical: Concurrent page splits and deletions may have caused the
+downlinked page to get a different HIKEY, which could have a different
+_bt_compare result.  To make this work, in _bt_search we cache the
+current downlink's right separator key, to then in the _bt_moveright
+phase of the layer below use memcmp() to validate our assumptions about
+the HIKEY matching the downlink's right separator key.  Only if the
+assumption is proven wrong (memcmp(HIKEY, right_sep) != 0), we call
+_bt_compare(), otherwise we can be certain that the parent page's result
+is unchanged, i.e. that _bt_compare would return "<".
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba225..e0979f0c81 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, IndexTuple rightsep);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -47,6 +50,28 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+static inline bool
+BTreeTupleDownlinkMatchesHikey(IndexTuple highkey, IndexTuple rightsep)
+{
+	int16		size = IndexTupleSize(highkey);
+	int			alt_tid;
+
+	if (size != IndexTupleSize(rightsep))
+		return false;
+
+	alt_tid = highkey->t_info & INDEX_ALT_TID_MASK;
+	if (alt_tid != (rightsep->t_info & INDEX_ALT_TID_MASK))
+		return false;
+
+	if (alt_tid && (ItemPointerGetOffsetNumberNoCheck(&highkey->t_tid) !=
+					ItemPointerGetOffsetNumberNoCheck(&rightsep->t_tid)))
+		return false;
+
+	if (memcmp(highkey + 1, rightsep + 1, size - sizeof(IndexTupleData)) != 0)
+		return false;
+
+	return true;
+}
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -98,6 +123,15 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	union {
+#if defined(pg_attribute_aligned)
+		pg_attribute_aligned(MAXIMUM_ALIGNOF)
+#endif
+		char	data[MAXALIGN_DOWN(BLCKSZ / 3)];
+		IndexTupleData tuple;
+		double	force_align_d;
+	}			rsepbuf;
+	IndexTuple	rightsep = NULL;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +168,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, rightsep);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -152,6 +186,25 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		child = BTreeTupleGetDownLink(itup);
 
+		if (offnum < PageGetMaxOffsetNumber(page))
+		{
+			ItemId	rightsepitem = PageGetItemId(page, offnum + 1);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+		else if (!P_RIGHTMOST(opaque))
+		{
+			/*
+			 * The rightmost data tuple on inner page has P_HIKEY as its right
+			 * separator.
+			 */
+			ItemId	rightsepitem = PageGetItemId(page, P_HIKEY);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+
 		/*
 		 * We need to save the location of the pivot tuple we chose in a new
 		 * stack entry for this page/level.  If caller ends up splitting a
@@ -194,7 +247,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  rightsep);
 	}
 
 	return stack_in;
@@ -238,7 +292,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  IndexTuple rightsep)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -297,7 +352,26 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque))
+		{
+			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+			continue;
+		}
+
+		if (PointerIsValid(rightsep))
+		{
+			/*
+			 * Note: we're not in the rightmost page (see branchpoint earlier in
+			 * the loop), so we always have a HIKEY on this page.
+			 */
+			ItemId	hikeyid = PageGetItemId(page, P_HIKEY);
+			IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
+
+			if (BTreeTupleDownlinkMatchesHikey(highkey, rightsep))
+				break;
+		}
+
+		if (_bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..de524ed188 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1229,9 +1229,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-- 
2.40.1