btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

Started by Matthias van de Meentabout 2 years ago11 messages

boekewurm+postgres@gmail.com

about 2 years ago

1 attachment(s)

Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0]/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com, but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0]/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with the downlink right separator/HIKEY optimization
(separate patch to be submitted soon), and specialization over at [0]/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com.

[0]: /messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com

Attachments:

v14-0001-btree-Implement-dynamic-prefix-compression.patchapplication/octet-stream; name=v14-0001-btree-Implement-dynamic-prefix-compression.patchDownload

From 22f7c69552bcd6b89a586554a1361926925f4332 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v14 1/7] btree: Implement dynamic prefix compression

Because tuples in the btree are ordered, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also contain the same prefix that equals the scankey.

Due to concurrent page splits and deletes, the key range of a
page may move while we travel down the btree. This means we can't
directly reuse left- and right-key prefixes we established on the
parent page while we descend the btree: the data on the page
may not match the parent page's separator keys anymore (see [0]).

So, we can't always descend the btree while keeping the discovered
prefixes. However, on each page we only see a snapshot of the
page, so we can use dynamic prefix truncation at the page level.
This still improves the worst-case number of compare operations
during btree descent from depth * ceil(log(tuples_per_page)) * natts
to depth * (ceil(log(tuples_per_page)) + 3 * natts).

In passing, move _bt_moveright from btree.h to nbtsearch.c: there
was no usage of the function left outside nbtsearch.c.
---
 contrib/amcheck/verify_nbtree.c       |  25 +++++--
 src/backend/access/nbtree/README      |  34 +++++++++
 src/backend/access/nbtree/nbtinsert.c |  34 ++++++---
 src/backend/access/nbtree/nbtsearch.c | 101 +++++++++++++++++++++++---
 src/include/access/nbtree.h           |   9 +--
 5 files changed, 168 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7282cf7fc8..40fc0a8af7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1706,6 +1706,8 @@ bt_target_page_check(BtreeCheckState *state)
 		if (state->checkunique && state->indexinfo->ii_Unique &&
 			P_ISLEAF(topaque) && OffsetNumberNext(offset) <= max)
 		{
+			AttrNumber cmpattr = 1;
+
 			/* Save current scankey tid */
 			scantid = skey->scantid;
 
@@ -1722,7 +1724,7 @@ bt_target_page_check(BtreeCheckState *state)
 			 * treated as different to any other key.
 			 */
 			if (_bt_compare(state->rel, skey, state->target,
-							OffsetNumberNext(offset)) != 0 || skey->anynullkeys)
+							OffsetNumberNext(offset), &cmpattr) != 0 || skey->anynullkeys)
 			{
 				lVis_i = -1;
 				lVis_tid = NULL;
@@ -1801,6 +1803,8 @@ bt_target_page_check(BtreeCheckState *state)
 			if (state->checkunique && state->indexinfo->ii_Unique &&
 				rightkey && P_ISLEAF(topaque) && rightblock_number != P_NONE)
 			{
+				AttrNumber cmpattr = 1;
+
 				elog(DEBUG2, "check cross page unique condition");
 
 				/*
@@ -1811,7 +1815,7 @@ bt_target_page_check(BtreeCheckState *state)
 				rightkey->scantid = NULL;
 
 				/* The first key on the next page is the same */
-				if (_bt_compare(state->rel, rightkey, state->target, max) == 0 && !rightkey->anynullkeys)
+				if (_bt_compare(state->rel, rightkey, state->target, max, &cmpattr) == 0 && !rightkey->anynullkeys)
 				{
 					elog(DEBUG2, "cross page equal keys");
 					state->target = palloc_btree_page(state,
@@ -3012,6 +3016,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		BTInsertStateData insertstate;
 		OffsetNumber offnum;
 		Page		page;
+		AttrNumber	compareattr = 1;
 
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -3021,13 +3026,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.buf = lbuf;
 
 		/* Get matching tuple on leaf page */
-		offnum = _bt_binsrch_insert(state->rel, &insertstate);
+		offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
 		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
 			insertstate.postingoff <= 0 &&
-			_bt_compare(state->rel, key, page, offnum) == 0)
+			_bt_compare(state->rel, key, page, offnum, &compareattr) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
 	}
@@ -3089,6 +3094,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	ItemId		itemid;
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(key->pivotsearch);
 
@@ -3099,7 +3105,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	if (!key->heapkeyspace)
 		return invariant_leq_offset(state, key, upperbound);
 
-	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound, &compareattr);
 
 	/*
 	 * _bt_compare() is capable of determining that a scankey with a
@@ -3151,10 +3157,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(key->pivotsearch);
 
-	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound, &compareattr);
 
 	return cmp <= 0;
 }
@@ -3174,10 +3181,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 				   OffsetNumber lowerbound)
 {
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(key->pivotsearch);
 
-	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound, &compareattr);
 
 	/* pg_upgrade'd indexes may legally have equal sibling tuples */
 	if (!key->heapkeyspace)
@@ -3212,13 +3220,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	ItemId		itemid;
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(key->pivotsearch);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
 								  upperbound);
-	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound, &compareattr);
 
 	/* pg_upgrade'd indexes may legally have equal sibling tuples */
 	if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..61601aca61 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,40 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about dynamic prefix truncation
+-------------------------------------
+
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.  If we know that both the left and right neighbour share a prefix
+with the scan key, we deduce that the current tuple must also share this
+prefix with the scan key, which allows us to skip this shared prefix during
+tuple compare operations when we descend the btree.
+
+However, the BTree is not a static sorted list: It is a dynamic data
+structure which changes over time.  Because of this, we cannot just reuse the
+established prefix from the parent page: concurrent page deletions may have
+moved in new tuples from the keyspace left of the left separator of this
+page's downlink on the parent page, and page splits may have moved tuples
+that were previously in this page's indicated keyspace away from the page.
+This means that we have to re-establish the prefix that is shared with the
+scan key at every level of the btree.
+
+Note that this is most effective on indexes where a prefix of key columns has
+many duplicate values.  Compare operations on full index tuples stop at the
+first non-equal attribute: it is quite unlikely that an index on
+(unique, 1, 1, 1) will ever compare the latter columns, while one on
+(1, 1, 1, unique) will essentially always have to compare all attributes to
+hit an attribute that isn't equal to that of the scan key.  The dynamic
+prefix truncation allows PostgreSQL to skip most compare operations on the
+(1, 1, 1 ...) prefix, but it cannot improve the (unique, ...) case because
+that has no "wasteful" prefix that we can skip.
+
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page))) attribute compare
+operations, which can improve performance significantly.
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 9cff4f2931..b9f2580c1d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -328,6 +328,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 		{
 			Page		page;
 			BTPageOpaque opaque;
+			AttrNumber	prefix = 1;
 
 			_bt_checkpage(rel, insertstate->buf);
 			page = BufferGetPage(insertstate->buf);
@@ -346,7 +347,8 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 				!P_IGNORE(opaque) &&
 				PageGetFreeSpace(page) > insertstate->itemsz &&
 				PageGetMaxOffsetNumber(page) >= P_HIKEY &&
-				_bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+				_bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+							&prefix) > 0)
 			{
 				/*
 				 * Caller can use the fastpath optimization because cached
@@ -440,7 +442,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
 	 */
 	Assert(!insertstate->bounds_valid);
-	offset = _bt_binsrch_insert(rel, insertstate);
+	offset = _bt_binsrch_insert(rel, insertstate, 1);
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
@@ -450,6 +452,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
+		AttrNumber	cmpatt = 1;
+
 		/*
 		 * Each iteration of the loop processes one heap TID, not one index
 		 * tuple.  Current offset number for page isn't usually advanced on
@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				Assert(insertstate->bounds_valid);
 				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
 				Assert(insertstate->low <= insertstate->stricthigh);
-				Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+				Assert(_bt_compare(rel, itup_key, page, offset, &cmpatt) < 0);
 				break;
 			}
 
@@ -510,7 +514,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				if (!inposting)
 				{
 					/* Plain tuple, or first TID in posting list tuple */
-					if (_bt_compare(rel, itup_key, page, offset) != 0)
+					if (_bt_compare(rel, itup_key, page, offset, &cmpatt) != 0)
 						break;	/* we're past all the equal tuples */
 
 					/* Advanced curitup */
@@ -720,11 +724,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 		else
 		{
 			int			highkeycmp;
+			cmpatt = 1;
 
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt);
 			Assert(highkeycmp <= 0);
 			if (highkeycmp != 0)
 				break;
@@ -867,6 +872,8 @@ _bt_findinsertloc(Relation rel,
 
 			for (;;)
 			{
+				AttrNumber	cmpatt = 1;
+
 				/*
 				 * Does the new tuple belong on this page?
 				 *
@@ -884,7 +891,7 @@ _bt_findinsertloc(Relation rel,
 
 				/* Test '<=', not '!=', since scantid is set now */
 				if (P_RIGHTMOST(opaque) ||
-					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+					_bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt) <= 0)
 					break;
 
 				_bt_stepright(rel, heapRel, insertstate, stack);
@@ -937,6 +944,8 @@ _bt_findinsertloc(Relation rel,
 		 */
 		while (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
+			AttrNumber	cmpatt = 1;
+
 			/*
 			 * Before considering moving right, see if we can obtain enough
 			 * space by erasing LP_DEAD items
@@ -967,7 +976,7 @@ _bt_findinsertloc(Relation rel,
 				break;
 
 			if (P_RIGHTMOST(opaque) ||
-				_bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+				_bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt) != 0 ||
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
@@ -982,10 +991,13 @@ _bt_findinsertloc(Relation rel,
 	 * We should now be on the correct page.  Find the offset within the page
 	 * for the new tuple. (Possibly reusing earlier search bounds.)
 	 */
-	Assert(P_RIGHTMOST(opaque) ||
-		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+	{
+		AttrNumber	cmpatt PG_USED_FOR_ASSERTS_ONLY = 1;
+		Assert(P_RIGHTMOST(opaque) ||
+			   _bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt) <= 0);
+	}
 
-	newitemoff = _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
 
 	if (insertstate->postingoff == -1)
 	{
@@ -1004,7 +1016,7 @@ _bt_findinsertloc(Relation rel,
 		 */
 		Assert(!insertstate->bounds_valid);
 		insertstate->postingoff = 0;
-		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
 		Assert(insertstate->postingoff == 0);
 	}
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5b..71fd658d15 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, AttrNumber *prefix);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	AttrNumber	highkeyprefix = 1;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +138,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, &highkeyprefix);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -185,6 +189,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 	 */
 	if (access == BT_WRITE && page_access == BT_READ)
 	{
+		highkeyprefix = 1;
+
 		/* trade in our read lock for a write lock */
 		_bt_unlockbuf(rel, *bufP);
 		_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -194,7 +200,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  &highkeyprefix);
 	}
 
 	return stack_in;
@@ -230,6 +237,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
  * On entry, we have the buffer pinned and a lock of the type specified by
  * 'access'.  If we move right, we release the buffer and lock and acquire
  * the same on the right sibling.  Return value is the buffer we stop at.
+ *
+ * On exit, we've updated *comparecol with the prefix that the high key shares
  */
 Buffer
 _bt_moveright(Relation rel,
@@ -238,7 +247,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  AttrNumber *prefix)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -262,16 +272,36 @@ _bt_moveright(Relation rel,
 	 *
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
+	 *
+	 * Note: It is important that *comparecol is set to an accurate value at
+	 * return. Rightmost pages have no prefix, thus set it to 1.  In all other
+	 * cases, it must be updated with the prefix result of this page's
+	 * HIGHKEY's full _bt_compare.
 	 */
 	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
+		/*
+		 * We explicitly don't reuse the prefix argument of this function,
+		 * as we may need to move multiple times, and we don't want to
+		 * overwrite the value stored in *prefix if we could reuse it.
+		 * Additionally, its value will be useless for any of our own
+		 * _bt_compare calls: only if the downlink right separator/HIKEY
+		 * optimization is applicable we can use the value, and when it is
+		 * applicable then we don't have to call _bt_compare.
+		 */
+		AttrNumber	prefixatt = 1;
+
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_RIGHTMOST(opaque))
+		{
+			/* set the compare column when breaking out of the loop */
+			*prefix = 1;
 			break;
+		}
 
 		/*
 		 * Finish any incomplete splits we encounter along the way.
@@ -297,14 +327,19 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) ||
+			_bt_compare(rel, key, page, P_HIKEY, &prefixatt) >= cmpval)
 		{
-			/* step right one page */
+			/* set the compare column when breaking out of the loop */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
 			continue;
 		}
 		else
+		{
+			/* make sure to set the compare column */
+			*prefix = prefixatt;
 			break;
+		}
 	}
 
 	if (P_IGNORE(opaque))
@@ -345,6 +380,13 @@ _bt_binsrch(Relation rel,
 				high;
 	int32		result,
 				cmpval;
+	/*
+	 * Prefix bounds, for the high/low offset's compare columns.
+	 * "highkeyprefix" is the value for this page's high key (if any) or 1
+	 * (no established shared prefix)
+	 */
+	AttrNumber	highkeyprefix = 1,
+				lowkeyprefix = 1;
 
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
@@ -377,6 +419,10 @@ _bt_binsrch(Relation rel,
 	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
 	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
 	 *
+	 * We maintain highkeyprefix and lowkeyprefix to keep track of prefixes
+	 * that tuples share with the scan key, potentially allowing us to skip a
+	 * prefix in the midpoint comparison.
+	 *
 	 * We can fall out when high == low.
 	 */
 	high++;						/* establish the loop invariant for high */
@@ -386,15 +432,22 @@ _bt_binsrch(Relation rel,
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
+		AttrNumber	prefix = Min(highkeyprefix, lowkeyprefix); /* update prefix bounds */
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare(rel, key, page, mid, &prefix);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lowkeyprefix = prefix;
+		}
 		else
+		{
 			high = mid;
+			highkeyprefix = prefix;
+		}
 	}
 
 	/*
@@ -439,7 +492,8 @@ _bt_binsrch(Relation rel,
  * list split).
  */
 OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+				   AttrNumber highkeyprefix)
 {
 	BTScanInsert key = insertstate->itup_key;
 	Page		page;
@@ -449,6 +503,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 				stricthigh;
 	int32		result,
 				cmpval;
+	AttrNumber	lowkeyprefix = 1;
 
 	page = BufferGetPage(insertstate->buf);
 	opaque = BTPageGetOpaque(page);
@@ -499,16 +554,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
+		AttrNumber	prefix = Min(highkeyprefix, lowkeyprefix);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare(rel, key, page, mid, &prefix);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lowkeyprefix = prefix;
+		}
 		else
 		{
 			high = mid;
+			highkeyprefix = prefix;
+
 			if (result != 0)
 				stricthigh = high;
 		}
@@ -643,6 +704,13 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
  * matching TID in the posting tuple, which caller must handle
  * themselves (e.g., by splitting the posting list tuple).
  *
+ * NOTE: The "comparecol" argument must refer to the first attribute of the
+ * index tuple of which the caller knows that it does not match the scan key:
+ * this means 1 for "no known matching attributes", up to the number of key
+ * attributes + 1 if the caller knows that all key attributes of the index
+ * tuple match those of the scan key.  See backend/access/nbtree/README for
+ * details.
+ *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored is explicitly truncated to 0
@@ -656,7 +724,8 @@ int32
 _bt_compare(Relation rel,
 			BTScanInsert key,
 			Page page,
-			OffsetNumber offnum)
+			OffsetNumber offnum,
+			AttrNumber *compareattr)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -696,8 +765,9 @@ _bt_compare(Relation rel,
 	ncmpkey = Min(ntupatts, key->keysz);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+
+	scankey = key->scankeys + ((*compareattr) - 1);
+	for (int i = *compareattr; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -741,11 +811,20 @@ _bt_compare(Relation rel,
 
 		/* if the keys are unequal, return the difference */
 		if (result != 0)
+		{
+			*compareattr = (AttrNumber) i;
 			return result;
+		}
 
 		scankey++;
 	}
 
+	/*
+	 * All tuple attributes are equal to the scan key, only later attributes
+	 * could potentially not equal the scan key.
+	 */
+	*compareattr = ntupatts + 1;
+
 	/*
 	 * All non-truncated attributes (other than heap TID) were found to be
 	 * equal.  Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086c..ae4e902ad0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1237,11 +1237,10 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+									   AttrNumber highkeyprefix);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+						 OffsetNumber offnum, AttrNumber *compareattr);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
-- 
2.40.1

Pavel Stehule

pavel.stehule@gmail.com

about 2 years ago

In reply to: Matthias van de Meent (#1)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent <
boekewurm+postgres@gmail.com> napsal:

Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

This can be nice functionality. I had a customer with a very slow index
scan - the main problem was a long common prefix like prg010203xxxx.

Regards

Pavel

Show quoted text

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with the downlink right separator/HIKEY optimization
(separate patch to be submitted soon), and specialization over at [0].

[0]
/messages/by-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Pavel Stehule (#2)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Wed, 1 Nov 2023 at 07:47, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

+1

Thanks for showing interest.

This can be nice functionality. I had a customer with a very slow index scan - the main problem was a long common prefix like prg010203xxxx.

I'll have to note that this patch doesn't cover cases where e.g. text
attributes have large shared prefixes, but are still unique: the
dynamic prefix compression in this patch is only implemented at the
tuple attribute level; it doesn't implement type aware dynamic prefix
compression inside the attributes. So, a unique index on a column of
int32 formatted like '%0100i' would not materially benefit from this
patch.

While would certainly be possible to add some type-level prefix
truncation in the framework of this patch, adding that would require
significant code churn in btree compare operators, because we'd need
an additional return argument to contain a numerical "shared prefix",
and that is not something I was planning to implement in this patch.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Pavel Stehule

pavel.stehule@gmail.com

about 2 years ago

In reply to: Matthias van de Meent (#3)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

st 1. 11. 2023 v 11:32 odesílatel Matthias van de Meent <
boekewurm+postgres@gmail.com> napsal:

On Wed, 1 Nov 2023 at 07:47, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Hi

út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent <

boekewurm+postgres@gmail.com> napsal:

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

+1

Thanks for showing interest.

This can be nice functionality. I had a customer with a very slow index

scan - the main problem was a long common prefix like prg010203xxxx.

I'll have to note that this patch doesn't cover cases where e.g. text
attributes have large shared prefixes, but are still unique: the
dynamic prefix compression in this patch is only implemented at the
tuple attribute level; it doesn't implement type aware dynamic prefix
compression inside the attributes. So, a unique index on a column of
int32 formatted like '%0100i' would not materially benefit from this
patch.

While would certainly be possible to add some type-level prefix
truncation in the framework of this patch, adding that would require
significant code churn in btree compare operators, because we'd need
an additional return argument to contain a numerical "shared prefix",
and that is not something I was planning to implement in this patch.

Thanks for the explanation.

Pavel

Show quoted text

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Dilip Kumar

dilipbalaut@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#1)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

+1 for the idea, I have some initial comments while reading through the patch.

1.
Commit message refers to a non-existing reference '(see [0])'.

2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.

I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.

3.
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page)))

Where do the 3*natts come from? Is it related to setting up the
dynamic prefix at each level?

4.
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *compareattr = ntupatts + 1;

Can you elaborate on this more? If all tuple attributes are equal to
the scan key then what do those 'later attributes' mean?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Dilip Kumar (#5)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Fri, 19 Jan 2024 at 05:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

+1 for the idea, I have some initial comments while reading through the patch.

Thank you for the review.

1.
Commit message refers to a non-existing reference '(see [0])'.

Noted, I'll update that.

2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.
I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.

Noted, will be fixed, too.

3.
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page)))
Where do the 3*natts come from? Is it related to setting up the
dynamic prefix at each level?

Yes: We need to establish prefixes for both a tuple that's ahead of
the to-be-compared tuple, and one that's after the to-be-compared
tuple. Assuming homogenous random distribution of scan key accesses
across the page (not always the case, but IMO a reasonable starting
point) this would average to 3 unprefixed compares before you have
established both a higher and a lower prefix.

4.
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *compareattr = ntupatts + 1;
Can you elaborate on this more? If all tuple attributes are equal to
the scan key then what do those 'later attributes' mean?

In inner pages, tuples may not have all key attributes, as some may
have been truncated away in page splits. So, tuples that have at least
the same prefix as this (potentially truncated) tuple will need to be
compared starting at the first missing attribute of this tuple, i.e.
ntupatts + 1.

Kind regards,

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#6)

1 attachment(s)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Wed, 24 Jan 2024 at 13:02, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

1.
Commit message refers to a non-existing reference '(see [0])'.

Noted, I'll update that.
2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.
I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.
Noted, will be fixed, too.

Attached is version 15 of this patch, with the above issues fixed.
It's also rebased on top of 655dc310 of this morning, so that should
keep good for some time again.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v15-0001-btree-Implement-dynamic-prefix-compression.patchapplication/octet-stream; name=v15-0001-btree-Implement-dynamic-prefix-compression.patchDownload

From 0217442ecfc846de8d213373d2d3f99d702f855d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v15] btree: Implement dynamic prefix compression

Because tuples in the btree are ordered, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also contain the same prefix that equals the scankey.

Due to concurrent page splits and deletes, the key range of a
page may move while we travel down the btree.  This means we
can't directly reuse left- and right-key prefixes we
established on the parent page while we descend the btree: the
data on the page may not match the parent page's separator
keys anymore (see [0]).

So, we can't always descend the btree while keeping the
discovered prefixes.  However, on each page we only see a
snapshot of the page, so we can use dynamic prefix truncation
at the page level.  This still improves the worst-case number
of compare operations during btree descent from  depth *
ceil(log(tuples_per_page)) * natts  to  depth *
(ceil(log(tuples_per_page)) + 3 * natts).

In passing, move _bt_moveright from btree.h to nbtsearch.c:
there was user of the function left outside nbtsearch.c.

[0] https://www.postgresql.org/message-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  25 +++++--
 src/backend/access/nbtree/README      |  34 +++++++++
 src/backend/access/nbtree/nbtinsert.c |  34 ++++++---
 src/backend/access/nbtree/nbtsearch.c | 101 +++++++++++++++++++++++---
 src/include/access/nbtree.h           |   9 +--
 5 files changed, 168 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 91caa53dd8..e9f80f3472 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1780,6 +1780,8 @@ bt_target_page_check(BtreeCheckState *state)
 		if (state->checkunique && state->indexinfo->ii_Unique &&
 			P_ISLEAF(topaque) && OffsetNumberNext(offset) <= max)
 		{
+			AttrNumber cmpattr = 1;
+
 			/* Save current scankey tid */
 			scantid = skey->scantid;
 
@@ -1796,7 +1798,7 @@ bt_target_page_check(BtreeCheckState *state)
 			 * treated as different to any other key.
 			 */
 			if (_bt_compare(state->rel, skey, state->target,
-							OffsetNumberNext(offset)) != 0 || skey->anynullkeys)
+							OffsetNumberNext(offset), &cmpattr) != 0 || skey->anynullkeys)
 			{
 				lVis_i = -1;
 				lVis_tid = NULL;
@@ -1875,6 +1877,8 @@ bt_target_page_check(BtreeCheckState *state)
 			if (state->checkunique && state->indexinfo->ii_Unique &&
 				rightkey && P_ISLEAF(topaque) && rightblock_number != P_NONE)
 			{
+				AttrNumber cmpattr = 1;
+
 				elog(DEBUG2, "check cross page unique condition");
 
 				/*
@@ -1885,7 +1889,7 @@ bt_target_page_check(BtreeCheckState *state)
 				rightkey->scantid = NULL;
 
 				/* The first key on the next page is the same */
-				if (_bt_compare(state->rel, rightkey, state->target, max) == 0 && !rightkey->anynullkeys)
+				if (_bt_compare(state->rel, rightkey, state->target, max, &cmpattr) == 0 && !rightkey->anynullkeys)
 				{
 					elog(DEBUG2, "cross page equal keys");
 					state->target = palloc_btree_page(state,
@@ -3087,6 +3091,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		BTInsertStateData insertstate;
 		OffsetNumber offnum;
 		Page		page;
+		AttrNumber	compareattr = 1;
 
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -3096,13 +3101,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.buf = lbuf;
 
 		/* Get matching tuple on leaf page */
-		offnum = _bt_binsrch_insert(state->rel, &insertstate);
+		offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
 		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
 			insertstate.postingoff <= 0 &&
-			_bt_compare(state->rel, key, page, offnum) == 0)
+			_bt_compare(state->rel, key, page, offnum, &compareattr) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
 	}
@@ -3164,6 +3169,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	ItemId		itemid;
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(!key->nextkey && key->backward);
 
@@ -3174,7 +3180,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	if (!key->heapkeyspace)
 		return invariant_leq_offset(state, key, upperbound);
 
-	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound, &compareattr);
 
 	/*
 	 * _bt_compare() is capable of determining that a scankey with a
@@ -3226,10 +3232,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(!key->nextkey && key->backward);
 
-	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound, &compareattr);
 
 	return cmp <= 0;
 }
@@ -3249,10 +3256,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 				   OffsetNumber lowerbound)
 {
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(!key->nextkey && key->backward);
 
-	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound, &compareattr);
 
 	/* pg_upgrade'd indexes may legally have equal sibling tuples */
 	if (!key->heapkeyspace)
@@ -3287,13 +3295,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	ItemId		itemid;
 	int32		cmp;
+	AttrNumber	compareattr = 1;
 
 	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
 								  upperbound);
-	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound, &compareattr);
 
 	/* pg_upgrade'd indexes may legally have equal sibling tuples */
 	if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..7f05aff44a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,40 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about dynamic prefix truncation
+-------------------------------------
+
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its right neighbour, and larger than its left
+neighbour.  If we know that both the left and right neighbour share a prefix
+with the scan key, we deduce that the current tuple must also share this
+prefix with the scan key, which allows us to skip this shared prefix during
+tuple compare operations when we descend the btree.
+
+However, the BTree is not a static sorted list: It is a dynamic data
+structure which changes over time.  Because of this, we cannot just reuse the
+established prefix from the parent page: concurrent page deletions may have
+moved in new tuples from the keyspace left of the left separator of this
+page's downlink on the parent page, and page splits may have moved tuples
+that were previously in this page's indicated keyspace away from the page.
+This means that we have to re-establish the prefix that is shared with the
+scan key at every level of the btree.
+
+Note that this is most effective on indexes where a prefix of key columns has
+many duplicate values.  Compare operations on full index tuples stop at the
+first non-equal attribute: it is quite unlikely that an index on
+(unique, 1, 1, 1) will ever compare the latter columns, while one on
+(1, 1, 1, unique) will essentially always have to compare all attributes to
+hit an attribute that isn't equal to that of the scan key.  The dynamic
+prefix truncation allows PostgreSQL to skip most compare operations on the
+(1, 1, 1 ...) prefix, but it cannot improve the (unique, ...) case because
+that has no "wasteful" prefix that we can skip.
+
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page))) attribute compare
+operations, which can improve performance significantly.
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e9cfc13604..12e0f9fd5e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -329,6 +329,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 		{
 			Page		page;
 			BTPageOpaque opaque;
+			AttrNumber	prefix = 1;
 
 			_bt_checkpage(rel, insertstate->buf);
 			page = BufferGetPage(insertstate->buf);
@@ -347,7 +348,8 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 				!P_IGNORE(opaque) &&
 				PageGetFreeSpace(page) > insertstate->itemsz &&
 				PageGetMaxOffsetNumber(page) >= P_HIKEY &&
-				_bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+				_bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+							&prefix) > 0)
 			{
 				/*
 				 * Caller can use the fastpath optimization because cached
@@ -441,7 +443,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
 	 */
 	Assert(!insertstate->bounds_valid);
-	offset = _bt_binsrch_insert(rel, insertstate);
+	offset = _bt_binsrch_insert(rel, insertstate, 1);
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
@@ -451,6 +453,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
+		AttrNumber	cmpatt = 1;
+
 		/*
 		 * Each iteration of the loop processes one heap TID, not one index
 		 * tuple.  Current offset number for page isn't usually advanced on
@@ -486,7 +490,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				Assert(insertstate->bounds_valid);
 				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
 				Assert(insertstate->low <= insertstate->stricthigh);
-				Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+				Assert(_bt_compare(rel, itup_key, page, offset, &cmpatt) < 0);
 				break;
 			}
 
@@ -511,7 +515,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				if (!inposting)
 				{
 					/* Plain tuple, or first TID in posting list tuple */
-					if (_bt_compare(rel, itup_key, page, offset) != 0)
+					if (_bt_compare(rel, itup_key, page, offset, &cmpatt) != 0)
 						break;	/* we're past all the equal tuples */
 
 					/* Advanced curitup */
@@ -721,11 +725,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 		else
 		{
 			int			highkeycmp;
+			cmpatt = 1;
 
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt);
 			Assert(highkeycmp <= 0);
 			if (highkeycmp != 0)
 				break;
@@ -868,6 +873,8 @@ _bt_findinsertloc(Relation rel,
 
 			for (;;)
 			{
+				AttrNumber	cmpatt = 1;
+
 				/*
 				 * Does the new tuple belong on this page?
 				 *
@@ -885,7 +892,7 @@ _bt_findinsertloc(Relation rel,
 
 				/* Test '<=', not '!=', since scantid is set now */
 				if (P_RIGHTMOST(opaque) ||
-					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+					_bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt) <= 0)
 					break;
 
 				_bt_stepright(rel, heapRel, insertstate, stack);
@@ -938,6 +945,8 @@ _bt_findinsertloc(Relation rel,
 		 */
 		while (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
+			AttrNumber	cmpatt = 1;
+
 			/*
 			 * Before considering moving right, see if we can obtain enough
 			 * space by erasing LP_DEAD items
@@ -968,7 +977,7 @@ _bt_findinsertloc(Relation rel,
 				break;
 
 			if (P_RIGHTMOST(opaque) ||
-				_bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+				_bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt) != 0 ||
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
 				break;
 
@@ -983,10 +992,13 @@ _bt_findinsertloc(Relation rel,
 	 * We should now be on the correct page.  Find the offset within the page
 	 * for the new tuple. (Possibly reusing earlier search bounds.)
 	 */
-	Assert(P_RIGHTMOST(opaque) ||
-		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+	{
+		AttrNumber	cmpatt PG_USED_FOR_ASSERTS_ONLY = 1;
+		Assert(P_RIGHTMOST(opaque) ||
+			   _bt_compare(rel, itup_key, page, P_HIKEY, &cmpatt) <= 0);
+	}
 
-	newitemoff = _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
 
 	if (insertstate->postingoff == -1)
 	{
@@ -1005,7 +1017,7 @@ _bt_findinsertloc(Relation rel,
 		 */
 		Assert(!insertstate->bounds_valid);
 		insertstate->postingoff = 0;
-		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		newitemoff = _bt_binsrch_insert(rel, insertstate, 1);
 		Assert(insertstate->postingoff == 0);
 	}
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 63ee9ba225..fac363aa20 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, AttrNumber *prefix);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	AttrNumber	highkeyprefix = 1;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +138,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, &highkeyprefix);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -185,6 +189,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 	 */
 	if (access == BT_WRITE && page_access == BT_READ)
 	{
+		highkeyprefix = 1;
+
 		/* trade in our read lock for a write lock */
 		_bt_unlockbuf(rel, *bufP);
 		_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -194,7 +200,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  &highkeyprefix);
 	}
 
 	return stack_in;
@@ -230,6 +237,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
  * On entry, we have the buffer pinned and a lock of the type specified by
  * 'access'.  If we move right, we release the buffer and lock and acquire
  * the same on the right sibling.  Return value is the buffer we stop at.
+ *
+ * On exit, we've updated *comparecol with the prefix that the high key shares
  */
 Buffer
 _bt_moveright(Relation rel,
@@ -238,7 +247,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  AttrNumber *prefix)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -262,16 +272,36 @@ _bt_moveright(Relation rel,
 	 *
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
+	 *
+	 * Note: It is important that *comparecol is set to an accurate value at
+	 * return. Rightmost pages have no prefix, thus set it to 1.  In all other
+	 * cases, it must be updated with the prefix result of this page's
+	 * HIGHKEY's full _bt_compare.
 	 */
 	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
+		/*
+		 * We explicitly don't reuse the prefix argument of this function,
+		 * as we may need to move multiple times, and we don't want to
+		 * overwrite the value stored in *prefix if we could reuse it.
+		 * Additionally, its value will be useless for any of our own
+		 * _bt_compare calls: only if the downlink right separator/HIKEY
+		 * optimization is applicable we can use the value, and when it is
+		 * applicable then we don't have to call _bt_compare.
+		 */
+		AttrNumber	prefixatt = 1;
+
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_RIGHTMOST(opaque))
+		{
+			/* set the compare column when breaking out of the loop */
+			*prefix = 1;
 			break;
+		}
 
 		/*
 		 * Finish any incomplete splits we encounter along the way.
@@ -297,14 +327,19 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) ||
+			_bt_compare(rel, key, page, P_HIKEY, &prefixatt) >= cmpval)
 		{
-			/* step right one page */
+			/* set the compare column when breaking out of the loop */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
 			continue;
 		}
 		else
+		{
+			/* make sure to set the compare column */
+			*prefix = prefixatt;
 			break;
+		}
 	}
 
 	if (P_IGNORE(opaque))
@@ -344,6 +379,13 @@ _bt_binsrch(Relation rel,
 				high;
 	int32		result,
 				cmpval;
+	/*
+	 * Prefix bounds, for the high/low offset's compare columns.
+	 * "highkeyprefix" is the value for this page's high key (if any) or 1
+	 * (no established shared prefix)
+	 */
+	AttrNumber	highkeyprefix = 1,
+				lowkeyprefix = 1;
 
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
@@ -376,6 +418,10 @@ _bt_binsrch(Relation rel,
 	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
 	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
 	 *
+	 * We maintain highkeyprefix and lowkeyprefix to keep track of prefixes
+	 * that tuples share with the scan key, potentially allowing us to skip a
+	 * prefix in the midpoint comparison.
+	 *
 	 * We can fall out when high == low.
 	 */
 	high++;						/* establish the loop invariant for high */
@@ -385,15 +431,22 @@ _bt_binsrch(Relation rel,
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
+		AttrNumber	prefix = Min(highkeyprefix, lowkeyprefix); /* update prefix bounds */
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare(rel, key, page, mid, &prefix);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lowkeyprefix = prefix;
+		}
 		else
+		{
 			high = mid;
+			highkeyprefix = prefix;
+		}
 	}
 
 	/*
@@ -465,7 +518,8 @@ _bt_binsrch(Relation rel,
  * list split).
  */
 OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+				   AttrNumber highkeyprefix)
 {
 	BTScanInsert key = insertstate->itup_key;
 	Page		page;
@@ -475,6 +529,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 				stricthigh;
 	int32		result,
 				cmpval;
+	AttrNumber	lowkeyprefix = 1;
 
 	page = BufferGetPage(insertstate->buf);
 	opaque = BTPageGetOpaque(page);
@@ -525,16 +580,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
+		AttrNumber	prefix = Min(highkeyprefix, lowkeyprefix);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare(rel, key, page, mid, &prefix);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lowkeyprefix = prefix;
+		}
 		else
 		{
 			high = mid;
+			highkeyprefix = prefix;
+
 			if (result != 0)
 				stricthigh = high;
 		}
@@ -669,6 +730,13 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
  * matching TID in the posting tuple, which caller must handle
  * themselves (e.g., by splitting the posting list tuple).
  *
+ * NOTE: The "comparecol" argument must refer to the first attribute of the
+ * index tuple of which the caller knows that it does not match the scan key:
+ * this means 1 for "no known matching attributes", up to the number of key
+ * attributes + 1 if the caller knows that all key attributes of the index
+ * tuple match those of the scan key.  See backend/access/nbtree/README for
+ * details.
+ *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored is explicitly truncated to 0
@@ -682,7 +750,8 @@ int32
 _bt_compare(Relation rel,
 			BTScanInsert key,
 			Page page,
-			OffsetNumber offnum)
+			OffsetNumber offnum,
+			AttrNumber *compareattr)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -722,8 +791,9 @@ _bt_compare(Relation rel,
 	ncmpkey = Min(ntupatts, key->keysz);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+
+	scankey = key->scankeys + ((*compareattr) - 1);
+	for (int i = *compareattr; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -767,11 +837,20 @@ _bt_compare(Relation rel,
 
 		/* if the keys are unequal, return the difference */
 		if (result != 0)
+		{
+			*compareattr = (AttrNumber) i;
 			return result;
+		}
 
 		scankey++;
 	}
 
+	/*
+	 * All tuple attributes are equal to the scan key, only later attributes
+	 * could potentially not equal the scan key.
+	 */
+	*compareattr = ntupatts + 1;
+
 	/*
 	 * All non-truncated attributes (other than heap TID) were found to be
 	 * equal.  Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..3c76963d03 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1229,11 +1229,10 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+									   AttrNumber highkeyprefix);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+						 OffsetNumber offnum, AttrNumber *compareattr);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
-- 
2.40.1

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#7)

1 attachment(s)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Fri, 1 Mar 2024 at 14:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Attached is version 15 of this patch, with the above issues fixed.
It's also rebased on top of 655dc310 of this morning, so that should
keep good for some time again.

Attached is version 16 now. Relevant changes from previous patch versions:

- Move from 1-indexed AttributeNumber to 0-indexed ints for prefixes,
and use "prefix" as naming scheme, rather than cmpcol. A lack of
prefix, previously indicated with a cmpcol value of 1, is now a prefix
value of 0.
- Adjusted README
- Improved the efficiency of the insertion path in some cases where
we've previously compared the page's highkey.

As always, why we need this:

Currently, btrees are quite inefficient when they have many key
attributes but low attribute cardinality in the prefix, e.g. an index
on ("", "", "", uuid). This is not just inefficient use of disk space
with the high repetition of duplicate prefix values in pages, but it
is also a computational overhead when we're descending the tree in
e.g. _bt_first() or btinsert(): The code we use to search a page
currently compares the full key to the full searchkey, for a
complexity of O(n_equal_attributes + 1) for every tuple on the page,
for O(log(page_ntups) * (n_equal_attributes + 1)) attribute compares
every page during descent.

This patch fixes that part of the computational complexity by applying
dynamic prefix compression, thus reducing the average computational
complexity in random index lookups to O(3 * (n_equal_attributes) +
log(page_ntups)) per page (assuming at least 4 non-hikey tuples on
each page). In practice, this makes indexes with 3+ attributes and
prefixes with low selectivity (such as the example above) much more
viable computationally, as we have to spend much less effort on
comparing known index attributes during descent.

Note that this does _not_ reuse prefix bounds across pages - it
re-establishes the left- and right prefixes every page during the
binary search. See the README modified in the patch for specific
implementation details and considerations.

This patch synergizes with the highkey optimization used in [0]/messages/by-id/CAEze2WijWhCQy_nZVP4Ye5Hwj=YW=3rqv+hbMJGcOHtrYQmyKw@mail.gmail.com: When
combined, the number of attribute compare function calls could be
further reduced to O(2 * (n_equal_atts) + log(page_ntups)), a
reduction by n_equal_atts every page, which in certain wide index
types could be over 25% of all attribute compare function calls on the
page after dynamic prefix truncation. However, both are separately
useful and reduce the amount of work done on most pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: /messages/by-id/CAEze2WijWhCQy_nZVP4Ye5Hwj=YW=3rqv+hbMJGcOHtrYQmyKw@mail.gmail.com

Attachments:

v16-0001-btree-Implement-dynamic-prefix-truncation.patchapplication/octet-stream; name=v16-0001-btree-Implement-dynamic-prefix-truncation.patchDownload

From ceeebf4636d2c4b1b570ba1333632940acaf5945 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 6 Aug 2024 22:05:51 +0200
Subject: [PATCH v16] btree: Implement dynamic prefix truncation

Because tuples in the btree are ordered, if some prefix of the
scan attributes on both sides of the compared tuple are equal
to the scankey, then the current tuple that is being compared
must also contain the same prefix that equals the scankey.

By applying this knowledge we can save a lot of attribute
compares while we descend a tree: We only have to establish
that a tuple left and right of the midpoint during binary
search share a prefix, and then we can use that to skip
comparing that prefix for further compares in the binary
search phase of index descent.

Due to concurrent page splits and deletes, the key range of a
page may move while we travel down the btree.  This means we
can't reuse left- and right-key prefixes we established on the
parent page while we descend the btree: the data on the page
may not match the parent page's separator keys anymore
(see [0]).

So, we can't always descend the btree while keeping the
discovered prefixes.  However, on each page we only see a
snapshot of the page, so we can use dynamic prefix truncation
at the page level.  This still improves the worst-case number
of compare operations during btree descent from  depth *
ceil(log(tuples_per_page)) * natts  to  depth *
(ceil(log(tuples_per_page)) + 3 * natts).

In passing, move _bt_moveright from btree.h to nbtsearch.c:
there was no user of the function left outside nbtsearch.c,
and there doesn't seem to be a use case for external usage.
This also allows the compiler to optimize a bit more than it
could previously.

[0] https://www.postgresql.org/message-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  24 ++++--
 src/backend/access/nbtree/README      |  37 ++++++++++
 src/backend/access/nbtree/nbtinsert.c |  41 ++++++++---
 src/backend/access/nbtree/nbtsearch.c | 102 +++++++++++++++++++++++---
 src/include/access/nbtree.h           |   9 +--
 5 files changed, 178 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7cfb136763..c7dc6725a4 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1798,6 +1798,8 @@ bt_target_page_check(BtreeCheckState *state)
 		if (state->checkunique && state->indexinfo->ii_Unique &&
 			P_ISLEAF(topaque) && OffsetNumberNext(offset) <= max)
 		{
+			int		sprefix = 0;
+
 			/* Save current scankey tid */
 			scantid = skey->scantid;
 
@@ -1817,7 +1819,7 @@ bt_target_page_check(BtreeCheckState *state)
 			 * bt_entry_unique_check() call if it was postponed.
 			 */
 			if (_bt_compare(state->rel, skey, state->target,
-							OffsetNumberNext(offset)) != 0 || skey->anynullkeys)
+							OffsetNumberNext(offset), &sprefix) != 0 || skey->anynullkeys)
 			{
 				lVis.blkno = InvalidBlockNumber;
 				lVis.offset = InvalidOffsetNumber;
@@ -1900,6 +1902,7 @@ bt_target_page_check(BtreeCheckState *state)
 				rightkey && P_ISLEAF(topaque) && !P_RIGHTMOST(topaque))
 			{
 				BlockNumber rightblock_number = topaque->btpo_next;
+				int		sprefix = 0;
 
 				elog(DEBUG2, "check cross page unique condition");
 
@@ -1911,7 +1914,7 @@ bt_target_page_check(BtreeCheckState *state)
 				rightkey->scantid = NULL;
 
 				/* The first key on the next page is the same */
-				if (_bt_compare(state->rel, rightkey, state->target, max) == 0 &&
+				if (_bt_compare(state->rel, rightkey, state->target, max, &sprefix) == 0 &&
 					!rightkey->anynullkeys)
 				{
 					Page		rightpage;
@@ -3174,6 +3177,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		BTInsertStateData insertstate;
 		OffsetNumber offnum;
 		Page		page;
+		int			sprefix = 0;
 
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
@@ -3183,13 +3187,13 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.buf = lbuf;
 
 		/* Get matching tuple on leaf page */
-		offnum = _bt_binsrch_insert(state->rel, &insertstate);
+		offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
 		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
 			insertstate.postingoff <= 0 &&
-			_bt_compare(state->rel, key, page, offnum) == 0)
+			_bt_compare(state->rel, key, page, offnum, &sprefix) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
 	}
@@ -3251,6 +3255,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	ItemId		itemid;
 	int32		cmp;
+	int			sprefix = 0;
 
 	Assert(!key->nextkey && key->backward);
 
@@ -3261,7 +3266,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	if (!key->heapkeyspace)
 		return invariant_leq_offset(state, key, upperbound);
 
-	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound, &sprefix);
 
 	/*
 	 * _bt_compare() is capable of determining that a scankey with a
@@ -3313,10 +3318,11 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
 	int32		cmp;
+	int			sprefix = 0;
 
 	Assert(!key->nextkey && key->backward);
 
-	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound, &sprefix);
 
 	return cmp <= 0;
 }
@@ -3336,10 +3342,11 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 				   OffsetNumber lowerbound)
 {
 	int32		cmp;
+	int			sprefix = 0;
 
 	Assert(!key->nextkey && key->backward);
 
-	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound, &sprefix);
 
 	/* pg_upgrade'd indexes may legally have equal sibling tuples */
 	if (!key->heapkeyspace)
@@ -3374,13 +3381,14 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	ItemId		itemid;
 	int32		cmp;
+	int			sprefix = 0;
 
 	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
 								  upperbound);
-	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound, &sprefix);
 
 	/* pg_upgrade'd indexes may legally have equal sibling tuples */
 	if (!key->heapkeyspace)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..9966d3b051 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,43 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about dynamic prefix truncation
+-------------------------------------
+
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its right neighbour, and larger than its left
+neighbour.  If we know that both the left and right neighbour share a prefix
+with the scan key, we deduce that the current tuple must also share this
+prefix with the scan key, which allows us to skip this shared prefix during
+tuple compare operations when we descend the btree.
+
+However, the BTree is not a static sorted list: It is a page-based dynamic
+data structure which changes over time.  Because of this, we cannot just reuse
+the established prefix from the parent page: concurrent page deletions may
+have moved in new tuples from the keyspace left of the left separator of this
+page's downlink on the parent page, and page splits may have moved tuples
+that were previously in this page's indicated keyspace away from the page.
+This means that we have to re-establish the prefix that is shared with the
+scan key at every level of the btree: the page we're about to visit could
+contain only values that would sort left of the parent page's left downlink,
+i.e. outside the range we've built a prefix for.
+
+NOTE: This optimization is most effective on indexes where a prefix of key
+columns has many duplicate values.  Compare operations on index tuples stop at
+the first non-equal attribute, so it is quite unlikely that an index on
+(unique_values, 1, 1, 1) will ever compare the latter columns during index
+descent(_bt_search), while one on (1, 1, 1, unique_values) will essentially
+always have to compare all attributes to hit an attribute that isn't equal to
+that of the scan key's (assuming the scan key is in the index's keyspace).
+The dynamic prefix truncation allows PostgreSQL to skip most compare
+operations on the (1, 1, 1 ...) prefix, but it cannot improve the
+(unique, ...) case because that has no "wasteful" prefix that we can skip.
+
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page))) attribute compare
+operations, which can improve performance significantly.
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7e8902e48c..62c87e72d8 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -328,6 +328,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 		{
 			Page		page;
 			BTPageOpaque opaque;
+			int			hikey_sprefix = 0;
 
 			_bt_checkpage(rel, insertstate->buf);
 			page = BufferGetPage(insertstate->buf);
@@ -346,7 +347,8 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
 				!P_IGNORE(opaque) &&
 				PageGetFreeSpace(page) > insertstate->itemsz &&
 				PageGetMaxOffsetNumber(page) >= P_HIKEY &&
-				_bt_compare(rel, insertstate->itup_key, page, P_HIKEY) > 0)
+				_bt_compare(rel, insertstate->itup_key, page, P_HIKEY,
+							&hikey_sprefix) > 0)
 			{
 				/*
 				 * Caller can use the fastpath optimization because cached
@@ -440,7 +442,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
 	 */
 	Assert(!insertstate->bounds_valid);
-	offset = _bt_binsrch_insert(rel, insertstate);
+	offset = _bt_binsrch_insert(rel, insertstate, 0);
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
@@ -450,6 +452,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
+		int			sprefix = 0;
+
 		/*
 		 * Each iteration of the loop processes one heap TID, not one index
 		 * tuple.  Current offset number for page isn't usually advanced on
@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				Assert(insertstate->bounds_valid);
 				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
 				Assert(insertstate->low <= insertstate->stricthigh);
-				Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+				Assert(_bt_compare(rel, itup_key, page, offset, &sprefix) < 0);
 				break;
 			}
 
@@ -510,7 +514,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				if (!inposting)
 				{
 					/* Plain tuple, or first TID in posting list tuple */
-					if (_bt_compare(rel, itup_key, page, offset) != 0)
+					if (_bt_compare(rel, itup_key, page, offset, &sprefix) != 0)
 						break;	/* we're past all the equal tuples */
 
 					/* Advanced curitup */
@@ -720,11 +724,12 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 		else
 		{
 			int			highkeycmp;
+			sprefix = 0;
 
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY, &sprefix);
 			Assert(highkeycmp <= 0);
 			if (highkeycmp != 0)
 				break;
@@ -823,6 +828,7 @@ _bt_findinsertloc(Relation rel,
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque opaque;
 	OffsetNumber newitemoff;
+	int			hikeysprefix = 0;
 
 	opaque = BTPageGetOpaque(page);
 
@@ -867,6 +873,8 @@ _bt_findinsertloc(Relation rel,
 
 			for (;;)
 			{
+				int		sprefix = 0;
+
 				/*
 				 * Does the new tuple belong on this page?
 				 *
@@ -884,8 +892,11 @@ _bt_findinsertloc(Relation rel,
 
 				/* Test '<=', not '!=', since scantid is set now */
 				if (P_RIGHTMOST(opaque) ||
-					_bt_compare(rel, itup_key, page, P_HIKEY) <= 0)
+					_bt_compare(rel, itup_key, page, P_HIKEY, &sprefix) <= 0)
+				{
+					hikeysprefix = sprefix;
 					break;
+				}
 
 				_bt_stepright(rel, heapRel, insertstate, stack);
 				/* Update local state after stepping right */
@@ -937,6 +948,8 @@ _bt_findinsertloc(Relation rel,
 		 */
 		while (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
+			int		sprefix = 0;
+
 			/*
 			 * Before considering moving right, see if we can obtain enough
 			 * space by erasing LP_DEAD items
@@ -967,9 +980,12 @@ _bt_findinsertloc(Relation rel,
 				break;
 
 			if (P_RIGHTMOST(opaque) ||
-				_bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+				_bt_compare(rel, itup_key, page, P_HIKEY, &sprefix) != 0 ||
 				pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 100))
+			{
+				hikeysprefix = sprefix;
 				break;
+			}
 
 			_bt_stepright(rel, heapRel, insertstate, stack);
 			/* Update local state after stepping right */
@@ -982,10 +998,13 @@ _bt_findinsertloc(Relation rel,
 	 * We should now be on the correct page.  Find the offset within the page
 	 * for the new tuple. (Possibly reusing earlier search bounds.)
 	 */
-	Assert(P_RIGHTMOST(opaque) ||
-		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+	{
+		int			sprefix PG_USED_FOR_ASSERTS_ONLY = 0;
+		Assert(P_RIGHTMOST(opaque) ||
+			   _bt_compare(rel, itup_key, page, P_HIKEY, &sprefix) <= 0);
+	}
 
-	newitemoff = _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate, hikeysprefix);
 
 	if (insertstate->postingoff == -1)
 	{
@@ -1004,7 +1023,7 @@ _bt_findinsertloc(Relation rel,
 		 */
 		Assert(!insertstate->bounds_valid);
 		insertstate->postingoff = 0;
-		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		newitemoff = _bt_binsrch_insert(rel, insertstate, hikeysprefix);
 		Assert(insertstate->postingoff == 0);
 	}
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 57bcfc7e4c..1abb1f8231 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, int *pagehikeysprefix);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	int			hikeysprefix = 0;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +138,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, &hikeysprefix);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -185,6 +189,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 	 */
 	if (access == BT_WRITE && page_access == BT_READ)
 	{
+		hikeysprefix = 0;
+
 		/* trade in our read lock for a write lock */
 		_bt_unlockbuf(rel, *bufP);
 		_bt_lockbuf(rel, *bufP, BT_WRITE);
@@ -194,7 +200,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  &hikeysprefix);
 	}
 
 	return stack_in;
@@ -230,6 +237,9 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
  * On entry, we have the buffer pinned and a lock of the type specified by
  * 'access'.  If we move right, we release the buffer and lock and acquire
  * the same on the right sibling.  Return value is the buffer we stop at.
+ *
+ * On exit, we've updated *pagehikeysprefix with the shared prefix that
+ * the returned buffer's page's highkey shares with the search key.
  */
 Buffer
 _bt_moveright(Relation rel,
@@ -238,7 +248,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  int *pagehikeysprefix)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -262,16 +273,34 @@ _bt_moveright(Relation rel,
 	 *
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
+	 *
+	 * Note: It is important that *pagehikeysprefix is set to an accurate
+	 * value at return.  Rightmost pages have no prefix, thus set it to 0.  In
+	 * all other cases, it must be updated with the prefix result of this
+	 * page's P_HIKEY's full _bt_compare.
 	 */
 	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
+		/*
+		 * We explicitly don't reuse the prefix argument of this function,
+		 * as we may need to move multiple times, and we don't want to
+		 * overwrite the value stored in *prefix if we could reuse it.
+		 * Additionally, its value will be useless for any of our own
+		 * _bt_compare calls due to potential concurrent page splits. 
+		 */
+		int			hikeysprefix = 0;
+
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
 		if (P_RIGHTMOST(opaque))
+		{
+			/* set the compare column when breaking out of the loop */
+			*pagehikeysprefix = 0;
 			break;
+		}
 
 		/*
 		 * Finish any incomplete splits we encounter along the way.
@@ -297,14 +326,19 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) ||
+			_bt_compare(rel, key, page, P_HIKEY, &hikeysprefix) >= cmpval)
 		{
-			/* step right one page */
+			/* set the compare column when breaking out of the loop */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
 			continue;
 		}
 		else
+		{
+			/* make sure to communicate the established prefix */
+			*pagehikeysprefix = hikeysprefix;
 			break;
+		}
 	}
 
 	if (P_IGNORE(opaque))
@@ -344,6 +378,13 @@ _bt_binsrch(Relation rel,
 				high;
 	int32		result,
 				cmpval;
+	/*
+	 * Prefix bounds, for the high/low offset's compare columns.
+	 * "highkeyprefix" is the value for this page's high key (if any) or 1
+	 * (no established shared prefix)
+	 */
+	AttrNumber	highkeyprefix = 0,
+				lowkeyprefix = 0;
 
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
@@ -376,6 +417,10 @@ _bt_binsrch(Relation rel,
 	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
 	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
 	 *
+	 * We maintain highkeyprefix and lowkeyprefix to keep track of prefixes
+	 * that tuples share with the scan key, potentially allowing us to skip a
+	 * prefix in the midpoint comparison.
+	 *
 	 * We can fall out when high == low.
 	 */
 	high++;						/* establish the loop invariant for high */
@@ -385,15 +430,22 @@ _bt_binsrch(Relation rel,
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
+		int			prefix = Min(highkeyprefix, lowkeyprefix); /* update prefix bounds */
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare(rel, key, page, mid, &prefix);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lowkeyprefix = prefix;
+		}
 		else
+		{
 			high = mid;
+			highkeyprefix = prefix;
+		}
 	}
 
 	/*
@@ -465,7 +517,8 @@ _bt_binsrch(Relation rel,
  * list split).
  */
 OffsetNumber
-_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+				   int highkeyprefix)
 {
 	BTScanInsert key = insertstate->itup_key;
 	Page		page;
@@ -475,6 +528,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 				stricthigh;
 	int32		result,
 				cmpval;
+	int			lowkeyprefix = 0;
 
 	page = BufferGetPage(insertstate->buf);
 	opaque = BTPageGetOpaque(page);
@@ -525,16 +579,22 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
+		int			prefix = Min(highkeyprefix, lowkeyprefix);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare(rel, key, page, mid, &prefix);
 
 		if (result >= cmpval)
+		{
 			low = mid + 1;
+			lowkeyprefix = prefix;
+		}
 		else
 		{
 			high = mid;
+			highkeyprefix = prefix;
+
 			if (result != 0)
 				stricthigh = high;
 		}
@@ -669,6 +729,13 @@ _bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
  * matching TID in the posting tuple, which caller must handle
  * themselves (e.g., by splitting the posting list tuple).
  *
+ * NOTE: The "sprefix" argument must refer to the number of attributes of
+ * the index tuple of which the caller knows that are equal to the scankey's
+ * attributes. This means 0 for "no known matching attributes", up to the
+ * number of key attributes if the caller knows that all key attributes of
+ * the index tuple match those of the scan key.
+ * See also backend/access/nbtree/README for details.
+ *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored is explicitly truncated to 0
@@ -682,7 +749,8 @@ int32
 _bt_compare(Relation rel,
 			BTScanInsert key,
 			Page page,
-			OffsetNumber offnum)
+			OffsetNumber offnum,
+			int *sprefix)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = BTPageGetOpaque(page);
@@ -722,8 +790,11 @@ _bt_compare(Relation rel,
 	ncmpkey = Min(ntupatts, key->keysz);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
-	scankey = key->scankeys;
-	for (int i = 1; i <= ncmpkey; i++)
+
+	scankey = key->scankeys + *sprefix;
+
+	/* get the first non-equal attribute number */
+	for (int i = *sprefix + 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -767,11 +838,20 @@ _bt_compare(Relation rel,
 
 		/* if the keys are unequal, return the difference */
 		if (result != 0)
+		{
+			*sprefix = i - 1;
 			return result;
+		}
 
 		scankey++;
 	}
 
+	/*
+	 * All tuple attributes are equal to the scan key, only later attributes
+	 * could potentially not equal the scan key.
+	 */
+	*sprefix = ntupatts;
+
 	/*
 	 * All non-truncated attributes (other than heap TID) were found to be
 	 * equal.  Treat truncated attributes as minus infinity when scankey has a
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7493043348..95a5b46b3e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1272,11 +1272,10 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
-extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
-extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate,
+									   int highkeyprefix);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page,
+						 OffsetNumber offnum, int *sprefix);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
-- 
2.40.1

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Matthias van de Meent (#8)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Attached is version 16 now.

I ran this with my old land registry benchmark, used for validating
the space utilization impact of nbtree deduplication (among other
things). This isn't obviously the best benchmark for this sort of
thing, but I seem to recall that you'd used it yourself at some point.
To validate work in this area, likely including this patch. So I
decided to start there.

To be clear, this test involves bulk loading of an unlogged table (the
land registry table). The following composite index is created on the
table before we insert any rows, so most of the cycles here are in
index maintenance including _bt_search descents:

CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
COLLATE "C", locality COLLATE "C");

I wasn't able to see much of an improvement with this patch applied.
It went from ~00:51.598 to ~00:51.053. That's a little disappointing,
given that this is supposed to be a sympathetic case for the patch.
Can you suggest something else? (Granted, I understand that this patch
has some complicated relationship with other patches of yours, which I
don't understand currently.)

I'm a bit worried about side-effects for this assertion:

@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState
insertstate, Relation heapRel,
                Assert(insertstate->bounds_valid);
                Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
                Assert(insertstate->low <= insertstate->stricthigh);
-               Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+               Assert(_bt_compare(rel, itup_key, page, offset, &sprefix) < 0);
                break;
            }

More generally, it's not entirely clear how the code in
_bt_check_unique is supposed to work with the patch. Why should it be
safe to do what you're doing with the prefix there? It's not like
we're doing a binary search here -- it's more like a linear search.

- Move from 1-indexed AttributeNumber to 0-indexed ints for prefixes,
and use "prefix" as naming scheme, rather than cmpcol. A lack of
prefix, previously indicated with a cmpcol value of 1, is now a prefix
value of 0.

Found a likely-related bug in the changes you made to amcheck, which I
was able to fix myself like so:

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index c7dc6725a..15be61777 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -3187,7 +3187,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
         insertstate.buf = lbuf;

         /* Get matching tuple on leaf page */
-        offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
+        offnum = _bt_binsrch_insert(state->rel, &insertstate, 0);
         /* Compare first >= matching item on leaf page, if any */

--
Peter Geoghegan

#10

Dmitry Dolgov

9erthalion6@gmail.com

about 1 year ago

In reply to: Peter Geoghegan (#9)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Tue, Aug 13, 2024 at 02:39:10PM GMT, Peter Geoghegan wrote:
On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Attached is version 16 now.

I ran this with my old land registry benchmark, used for validating
the space utilization impact of nbtree deduplication (among other
things). This isn't obviously the best benchmark for this sort of
thing, but I seem to recall that you'd used it yourself at some point.
To validate work in this area, likely including this patch. So I
decided to start there.

To be clear, this test involves bulk loading of an unlogged table (the
land registry table). The following composite index is created on the
table before we insert any rows, so most of the cycles here are in
index maintenance including _bt_search descents:

CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
COLLATE "C", locality COLLATE "C");

I wasn't able to see much of an improvement with this patch applied.
It went from ~00:51.598 to ~00:51.053. That's a little disappointing,
given that this is supposed to be a sympathetic case for the patch.
Can you suggest something else? (Granted, I understand that this patch
has some complicated relationship with other patches of yours, which I
don't understand currently.)

Under the danger of showing my ignorance, what is the definition of land
registry benchmark? I think it would be useful if others could reproduce
the results as well, especially if they're somewhat surprising.

At the same time I would like to note, that dynamic prefix truncation
can be useful not only on realistic data, but in some weird-but-useful
cases, e.g. for partitioned B-Tree with an artificial leading key
separating partitions. It's a hypothetical case, which can't justify
a feature of course, but makes it worth investigating.

#11

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Dmitry Dolgov (#10)

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

On Wed, Nov 13, 2024 at 3:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Tue, Aug 13, 2024 at 02:39:10PM GMT, Peter Geoghegan wrote:
On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
To be clear, this test involves bulk loading of an unlogged table (the
land registry table). The following composite index is created on the
table before we insert any rows, so most of the cycles here are in
index maintenance including _bt_search descents:

CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
COLLATE "C", locality COLLATE "C");

Under the danger of showing my ignorance, what is the definition of land
registry benchmark? I think it would be useful if others could reproduce
the results as well, especially if they're somewhat surprising.

It's a sample dataset that I've found useful from time to time,
particularly when testing nbtree features. Usually using a composite
index like the one I described.

One slightly useful (though far from unique) property of such an index
is that it contains data that's low cardinality (sometimes extremely
low cardinality) across multiple index columns. With variable-width
(text) index columns. That specific combination made the index a
decent test of certain issues affecting the nbtsplitloc.c split point
choice logic during work on Postgres 12 and 13. I believe that
Matthias independently found it useful on a number of other occasions,
too.

See https://wiki.postgresql.org/wiki/Sample_Databases for instructions
on how to set it up for yourself. You could probably come up with a
way of generating a similar dataset, without needing to download
anything, though. The fact that I found it useful in the past is at
least somewhat arbitrary.

--
Peter Geoghegan