Optimizing "boundary cases" during backward scan B-Tree index descents

Started by Peter Geogheganover 2 years ago5 messages

pg@bowt.ie

over 2 years ago

1 attachment(s)

An important goal of the work on nbtree that went into PostgreSQL 12
(and to a lesser extent the work that went into 13) was to make sure
that index scans deal with "boundary cases" optimally. The simplest
way of explaining what this means is through a practical worked
example.

Recap, worked example:

Consider a query such as "select count(*) from tenk2 where hundred =
$1", where tenk1 is one of the Wisconsin benchmark tables left behind
by the standard regression tests. In practice this query will show
that there are either 0 or 100 rows counted for every possible $1
value. In practice query execution only ever needs to scan exactly one
leaf page when executing this query -- for any possible $1 value. In
particular, it doesn't matter if $1 is a value that happens to be a
"boundary value". By "boundary value" I mean a value that happens to
match the leftmost or the rightmost tuple on some leaf page. In other
words, a value that happens to match a key from some internal page
seen while descending the tree in _bt_search.

This effect is reliable for the tenk1_hundred index that my example
query uses (a single column index) because numerous optimizations are
in place, which work together. This starts with _bt_search, which will
reliably descend to exactly one leaf page -- the only one where we
could possibly find a match (thanks to the !pivotsearch optimization).
After that, _bt_readpage() will reliably notice that it doesn't have
to go to the page to the right at all (occasionally it'll need to
check the high key to do this, and so will use the optimization
introduced to Postgres 12 by commit 29b64d1d).

(It's easy to see this using "EXPLAIN (ANALYZE, BUFFERS)", which will
reliably break out index pages when we're using a bitmap index scan --
which happens to be the case here. You'll see one root page access and
one leaf page access for the tenk1_hundred index.)

This is just an example of a fairly general principle: we ought to
expect selective index scans to only need to access one leaf page. At
least for any index scan that only needs to access a single "natural
grouping" that is cleanly isolated onto a single leaf page.

Another example of such a "natural grouping" can also be seen in the
TPC-C orderline table's primary key. In practice we shouldn't ever
need to touch more than a single leaf page when we go to read any
individual order's entries from the order lines table/PK. There will
never be more than 15 order lines per order in practice (10 on
average), which guarantees that suffix truncation will avoid splitting
any individual order's order lines across two leaf pages (after a page
split). Plus we're careful to exploit the index structure (and basic
B-Tree index invariants) to maximum effect during index scans....as
long as they're forward scans.

(This ends my recap of "boundary case" handling.)

Patch:

We still fall short when it comes to handling boundary cases optimally
during backwards scans. This is at least true for a subset of
backwards scans that request "goback=true" processing inside
_bt_first. Attached patch improves matters here. Again, the simplest
way of explaining what this does is through a practical worked
example.

Consider an example where we don't do as well as might be expected right now:

EXPLAIN (ANALYZE, BUFFERS) select * from tenk1 where hundred < 12
order by hundred desc limit 1;

This particular example shows "Buffers: shared hit=4". We see one more
buffer hit than is truly necessary though. If I tweak the query by
replacing 12 with the adjacent value 11 (i.e. same query but with
"hundred < 11"), then I'll see "Buffers: shared hit=3" instead.
Similarly, "hundred < 13" will also show "Buffers: shared hit=3". Why
should "hundred <12" need to access an extra leaf page?

Sure enough, the problematic case shows "Buffers: shared hit=3" with
my patch applied, as expected (a buffer for the root page, a leaf
page, and a heap page). The patch makes every variant of my query
touch the same number of leaf pages/buffers, as expected.

The patch teaches _bt_search (actually, _bt_compare) about the
"goback=true" case. This allows it to exploit information about which
leaf page accesses are truly necessary. The code in question already
knows about "nextkey=true", which is a closely related concept. It
feels very natural to also teach it about "goback=true" now.

(Actually, I lied. All the patch really does is rename existing
"pivotsearch" logic, so that the symbol name "goback" is used instead
-- the insertion scankey code doesn't need to change at all. The real
behavioral change takes place in _bt_first, the higher level calling
code. It has been taught to set its insertion/initial positioning
scankey's pivotsearch/goback field to "true" in the patch. Before now,
this option was exclusively during VACUUM, for page deletion. It turns
out that so-called "pivotsearch" behavior is far more general than
currently supposed.)

Thoughts?
--
Peter Geoghegan

Attachments:

v1-0001-Add-nbtree-goback-boundary-case-optimization.patchapplication/octet-stream; name=v1-0001-Add-nbtree-goback-boundary-case-optimization.patchDownload

From ab3d7230031bfc6461b2334f564c434ae0cd4b12 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 Jun 2023 09:58:43 -0700
Subject: [PATCH v1] Add nbtree "goback" boundary case optimization.

---
 src/include/access/nbtree.h           |  9 ++--
 src/backend/access/nbtree/nbtpage.c   | 19 +++++--
 src/backend/access/nbtree/nbtsearch.c | 75 ++++++++++++++++-----------
 src/backend/access/nbtree/nbtutils.c  | 12 +++--
 contrib/amcheck/verify_nbtree.c       | 16 +++---
 5 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa797..6cffbf6d5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -767,9 +767,10 @@ typedef BTStackData *BTStack;
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * pivotsearch is set to true by callers that want to re-find a leaf page
- * using a scankey built from a leaf page's high key.  Most callers set this
- * to false.
+ * goback is set to true by callers that intend to back up from the first
+ * match according to the scankey once on the leaf level.  Forward scan
+ * callers always set this to false.  Some backward scan callers prefer to set
+ * this to true to avoid scanning an extra leaf page in boundary cases.
  *
  * scantid is the heap TID that is used as a final tiebreaker attribute.  It
  * is set to NULL when index scan doesn't need to find a position for a
@@ -792,7 +793,7 @@ typedef struct BTScanInsertData
 	bool		allequalimage;
 	bool		anynullkeys;
 	bool		nextkey;
-	bool		pivotsearch;
+	bool		goback;
 	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c2050656e..61e4c7dac 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1958,12 +1958,21 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 					return;
 				}
 
-				/* we need an insertion scan key for the search, so build one */
+				/*
+				 * We need an insertion scan key for the search, so build one.
+				 * We specify goback=true because we're searching for the
+				 * location of a matching high key (not the location of the
+				 * first matching non-pivot tuple, which will be in the right
+				 * sibling of the sleafbuf page returned by _bt_search).
+				 *
+				 * Note: Unlike regular "goback" searchers, we're not prepared
+				 * to step back from the page returned by _bt_search.  That
+				 * makes specifying goback=true crucial here (for us it's not
+				 * merely about handling certain boundary cases optimally.)
+				 */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page with matching pivot/high key */
-				itup_key->pivotsearch = true;
-				stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ,
-								   NULL);
+				itup_key->goback = true;
+				stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ, NULL);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7e05e5867..d7798510d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -792,11 +792,24 @@ _bt_compare(Relation rel,
 		 * doesn't have to descend left because it isn't interested in a match
 		 * that has a heap TID value of -inf.
 		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
+		 * We take a symmetrical approach in our handling of such a boundary
+		 * case during goback searches (though with the same goal in mind as
+		 * the regular/!goback case).  Here we treat -inf as a possible match
+		 * for omitted scankey attribute(s) instead.  That has the effect of
+		 * making the search avoid uselessly locking/pinning the page to the
+		 * right.  In other words, goback search should descend left directly
+		 * because it is only interested in a match that is strictly < 'foo'.
+		 * All goback searches are from backward scan callers that don't
+		 * actually need to visit the page to the right at all.
+		 *
+		 * Note: Some goback searches won't get this behavior.  Consider a
+		 * backward scan whose insertion scan key locates tuples <= 'foo'.
+		 * This search needs to be able to find matches on both sides of the
+		 * ('foo', inf) leaf level high key, so visiting at least two leaf
+		 * pages is unavoidable.  This case will use an initial insertion scan
+		 * key that is goback=true and nextkey=true.  (When nextkey=true then
+		 * goback won't influence how _bt_search descends the tree, no matter
+		 * what we do here.)
 		 *
 		 * Note: the heap TID part of the test ensures that scankey is being
 		 * compared to a pivot tuple with one or more truncated key
@@ -808,9 +821,13 @@ _bt_compare(Relation rel,
 		 * non-key attributes).  !heapkeyspace searches must always be
 		 * prepared to deal with matches on both sides of the pivot once the
 		 * leaf level is reached.
+		 *
+		 * Note: it would be safe (though far from optimal) for every search
+		 * to use goback=true; heapkeyspace searches _effectively_ do so for
+		 * every _bt_search call.
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		if (heapTid == NULL && key->keysz == ntupatts && !key->goback &&
+			key->heapkeyspace)
 			return 1;
 
 		/* All provided scankey arguments found to be equal */
@@ -875,8 +892,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
-	bool		nextkey;
-	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
@@ -1283,6 +1298,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * goback = false, we will start the scan on the located item.
 	 *----------
 	 */
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	inskey.anynullkeys = false; /* unused */
+	inskey.scantid = NULL;
+	inskey.keysz = keysCount;
 	switch (strat_total)
 	{
 		case BTLessStrategyNumber:
@@ -1293,8 +1312,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * for a backward scan, so that is always the correct starting
 			 * position.)
 			 */
-			nextkey = false;
-			goback = true;
+			inskey.nextkey = false;
+			inskey.goback = true;
 			break;
 
 		case BTLessEqualStrategyNumber:
@@ -1305,8 +1324,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * for a backward scan, so that is always the correct starting
 			 * position.)
 			 */
-			nextkey = true;
-			goback = true;
+			inskey.nextkey = true;
+			inskey.goback = true;
 			break;
 
 		case BTEqualStrategyNumber:
@@ -1321,8 +1340,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 * This is the same as the <= strategy.  We will check at the
 				 * end whether the found item is actually =.
 				 */
-				nextkey = true;
-				goback = true;
+				inskey.nextkey = true;
+				inskey.goback = true;
 			}
 			else
 			{
@@ -1330,8 +1349,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 * This is the same as the >= strategy.  We will check at the
 				 * end whether the found item is actually =.
 				 */
-				nextkey = false;
-				goback = false;
+				inskey.nextkey = false;
+				inskey.goback = false;
 			}
 			break;
 
@@ -1341,8 +1360,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * Find first item >= scankey.  (This is only used for forward
 			 * scans.)
 			 */
-			nextkey = false;
-			goback = false;
+			inskey.nextkey = false;
+			inskey.goback = false;
 			break;
 
 		case BTGreaterStrategyNumber:
@@ -1351,8 +1370,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * Find first item > scankey.  (This is only used for forward
 			 * scans.)
 			 */
-			nextkey = true;
-			goback = false;
+			inskey.nextkey = true;
+			inskey.goback = false;
 			break;
 
 		default:
@@ -1361,14 +1380,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
-
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
@@ -1420,9 +1431,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * the last item on this page.  Adjust the starting offset if needed. (If
 	 * this results in an offset before the first item or after the last one,
 	 * _bt_readpage will report no items found, and then we'll step to the
-	 * next page as needed.)
+	 * next page as needed.  _bt_search avoids this whenever possible.)
 	 */
-	if (goback)
+	if (inskey.goback)
 		offnum = OffsetNumberPrev(offnum);
 
 	/* remember which buffer we have pinned, if any */
@@ -1614,6 +1625,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
@@ -1722,6 +1734,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				tuple_alive = true;
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
 										 &continuescan);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..87714aba0 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -65,8 +65,8 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		When itup is a non-pivot tuple, the returned insertion scan key is
  *		suitable for finding a place for it to go on the leaf level.  Pivot
  *		tuples can be used to re-find leaf page with matching high key, but
- *		then caller needs to set scan key's pivotsearch field to true.  This
- *		allows caller to search for a leaf page with a matching high key,
+ *		then caller needs to set insertion scan key's goback field to true.
+ *		This allows caller to search for a leaf page with a matching high key,
  *		which is usually to the left of the first leaf page a non-pivot match
  *		might appear on.
  *
@@ -121,7 +121,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
-	key->pivotsearch = false;
+	key->goback = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
 	key->scantid = key->heapkeyspace && itup ?
 		BTreeTupleGetHeapTID(itup) : NULL;
@@ -725,7 +725,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * failure of either key is indeed enough to stop the scan.  (In general, when
  * inequality keys are present, the initial-positioning code only promises to
  * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * for a forward scan; or after the last match for a backward scan.  Note,
+ * however, that there are various optimizations in the initial-positioning
+ * code that maximize the likelihood that _bt_search will return the leaf page
+ * whose key space makes it precisely the first/last leaf page where a match
+ * might be found.)
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 94a975932..e108e806e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2779,7 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->goback);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, state->targetblock, state->target,
@@ -2841,7 +2841,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->goback);
 
 	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
@@ -2864,7 +2864,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->goback);
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
@@ -2902,7 +2902,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->goback);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
@@ -3128,9 +3128,9 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * For example, invariant_g_offset() might miss a cross-page invariant failure
  * on an internal level if the scankey built from the first item on the
  * target's right sibling page happened to be equal to (not greater than) the
- * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
- * might otherwise cause amcheck to assume (rather than actually verify) that
- * the scankey is greater.
+ * last item on target page.  The !goback tiebreaker in _bt_compare() might
+ * otherwise cause amcheck to assume (rather than actually verify) that the
+ * scankey is greater.
  */
 static inline BTScanInsert
 bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
@@ -3138,7 +3138,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 	BTScanInsert skey;
 
 	skey = _bt_mkscankey(rel, itup);
-	skey->pivotsearch = true;
+	skey->goback = true;
 
 	return skey;
 }
-- 
2.40.1

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#1)

1 attachment(s)

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

On Mon, Jun 19, 2023 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

We still fall short when it comes to handling boundary cases optimally
during backwards scans. This is at least true for a subset of
backwards scans that request "goback=true" processing inside
_bt_first. Attached patch improves matters here. Again, the simplest
way of explaining what this does is through a practical worked
example.

On further reflection, we should go even further in the direction of
teaching _bt_search (and related routines) about the initial leaf
level positioning requirements of backwards scans. In fact, we should
give _bt_search and friends exclusive responsibility for dealing with
initial positioning on behalf of _bt_first. Attached revision shows
how this can work.

This new direction is partly based on the observation that "goback" is
really just a synonym of "backwards scan initial positioning
behavior": all backwards scans already use "goback", while all forward
scans use "!goback". So why shouldn't we just change any "goback"
symbol names to "backward", and be done with it? Then we can move the
"step back one item on leaf level" logic from _bt_first over to
_bt_binsrch. Now _bt_search/_bt_binsrch/_bt_compare own everything to
do with initial positioning.

The main benefit of this approach is that it allows _bt_first to
describe how its various initial positioning strategies work using
high level language, while pretty much leaving the implementation
details up to _bt_search. I've always thought that it was confusing
that the "<= strategy" uses "nextkey=true" -- how can it be "next key"
while also returning keys that directly match those from the insertion
scankey? It only makes sense once you see that the "<= strategy" uses
both "nextkey=true" and "backwards/goback = true" -- something that
the structure in the patch makes clear and explicit.

This revision also adds test coverage for the aforementioned "<=
strategy" (not to be confused with the strategy that we're
optimizing), since it was missing before now. It also adds test
coverage for the "< strategy" (which *is* the strategy affected by the
new optimization). The "< strategy" already has decent enough coverage
-- it just doesn't have coverage that exercises the new optimization.
(Note that I use the term "new optimization" advisedly here -- the new
behavior is closer to "how it's really supposed to work".)

I'm happy with the way that v2 came out, since the new structure makes
a lot more sense to me. The refactoring is probably the most important
aspect of this patch. The new structure seems like it might become
important in a world with skip scan or other new MDAM techniques added
to B-Tree. The important principle here is "think in terms of logical
key space, not in terms of physical pages".

Adding this to the next CF.

--
Peter Geoghegan

Attachments:

v2-0001-Optimize-nbtree-backward-scan-boundary-cases.patchapplication/octet-stream; name=v2-0001-Optimize-nbtree-backward-scan-boundary-cases.patchDownload

From 3879722c7c0d952cfcbcfda1e86ff93982564f96 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 Jun 2023 09:58:43 -0700
Subject: [PATCH v2] Optimize nbtree backward scan boundary cases.

Teach _bt_search (and related helper routines like _bt_binsrch and
_bt_compare) about the initial positioning requirements of backward
scans.  This relieves _bt_first from having to deal with those details
for itself.

Now that certain implementation details are hidden from _bt_first, it's
easy to add a new optimization: backwards scans that use the < strategy
avoid uselessly scanning an extra leaf page in certain "boundary cases".

Consider the following example:

SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1;

Before this commit, nbtree would needlessly scan two leaf pages, even
though it was only really necessary to scan one leaf page (when the
relevant leaf page's high key had the value (12, -inf), as the relevant
tenk1 regression test index does).  nbtree failed to take advantage of
the fact that we can descend straight to the leaf page that contains the
(12, -inf) high key.

nbtree now descends to the left directly whenever it encounters any
query with a similar boundary case.  This relies on the fact that the
right sibling page cannot possibly contain matching non-pivot tuples.
It only applies to backward scans that use the < strategy -- it does not
apply to backward scans that use the <= strategy (since there might be
matches on both sides of the leaf page high key with the <= strategy).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
---
 src/include/access/nbtree.h               |  11 +-
 src/backend/access/nbtree/nbtpage.c       |  16 +-
 src/backend/access/nbtree/nbtsearch.c     | 207 +++++++++++-----------
 src/backend/access/nbtree/nbtutils.c      |  18 +-
 contrib/amcheck/verify_nbtree.c           |  16 +-
 src/test/regress/expected/btree_index.out |  48 +++++
 src/test/regress/sql/btree_index.sql      |  26 +++
 7 files changed, 208 insertions(+), 134 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa797..fb51924b6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -763,13 +763,8 @@ typedef BTStackData *BTStack;
  * bit, but may not when inserting into an INCLUDE index (tuple header value
  * is affected by the NULL-ness of both key and non-key attributes).
  *
- * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
- * locate the first item >= scankey.  When nextkey is true, they will locate
- * the first item > scan key.
- *
- * pivotsearch is set to true by callers that want to re-find a leaf page
- * using a scankey built from a leaf page's high key.  Most callers set this
- * to false.
+ * See comments in _bt_first for an explanation of the nextkey and backward
+ * fields.
  *
  * scantid is the heap TID that is used as a final tiebreaker attribute.  It
  * is set to NULL when index scan doesn't need to find a position for a
@@ -792,7 +787,7 @@ typedef struct BTScanInsertData
 	bool		allequalimage;
 	bool		anynullkeys;
 	bool		nextkey;
-	bool		pivotsearch;
+	bool		backward;		/* backward index scan? */
 	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c2050656e..f84ca4d32 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1958,10 +1958,20 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 					return;
 				}
 
-				/* we need an insertion scan key for the search, so build one */
+				/*
+				 * We need an insertion scan key for the search, so build one.
+				 *
+				 * _bt_search searches for the leaf page with the first
+				 * matching non-pivot tuple.  We intend to "search" for a
+				 * matching pivot tuple in the leaf page underdoing deletion.
+				 *
+				 * Compensate for the mismatch by specifying backward=true.
+				 * That makes _bt_search locate the page < the position where
+				 * matching non-pivot tuples go.  _bt_search handles pivot
+				 * "boundary cases" intelligently, so this does what we need.
+				 */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page with matching pivot/high key */
-				itup_key->pivotsearch = true;
+				itup_key->backward = true;
 				stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ,
 								   NULL);
 				/* won't need a second lock or pin on leafbuf */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7e05e5867..ec3df1b18 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -328,9 +328,12 @@ _bt_moveright(Relation rel,
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
+ * key >= given scankey, or > scankey if nextkey is true for forward scans.
+ * _bt_binsrch() also "steps back" by one item/tuple on the leaf level in the
+ * case of backward scans.  (NOTE: this means it is possible to return a value
+ * that's 1 greater than the number of keys on the leaf page.  It also means
+ * that we can return an item 1 less than the first non-pivot tuple on any
+ * leaf page.)
  *
  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
  * of the last key < given scankey, or last key <= given scankey if nextkey
@@ -372,7 +375,7 @@ _bt_binsrch(Relation rel,
 	 * this covers two cases: the page is really empty (no keys), or it
 	 * contains only a high key.  The latter case is possible after vacuuming.
 	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
+	 * never empty (an internal page must have at least one child).
 	 */
 	if (unlikely(high < low))
 		return low;
@@ -412,10 +415,26 @@ _bt_binsrch(Relation rel,
 	 * past the last slot on the page.
 	 *
 	 * On a leaf page, we always return the first key >= scan key (resp. >
-	 * scan key), which could be the last slot + 1.
+	 * scan key) for forward scan callers, which could be the last slot + 1.
 	 */
 	if (P_ISLEAF(opaque))
+	{
+		/*
+		 * Backward scans should (by definition) be initially positioned at
+		 * the end of some group of equal-to-scankey non-pivot tuples (forward
+		 * scans are initially positioned at the start of such a grouping).
+		 *
+		 * The backward scan case now has a low/offnum is "at the start of the
+		 * wrong grouping".  Step back to make low/offnum "at the end of the
+		 * correct grouping".  This is what our _bt_first caller expects.
+		 * (See comments in _bt_first for an explanation of how the backward
+		 * and nextkey flags influence the initial position of index scans.)
+		 */
+		if (unlikely(key->backward))
+			low = OffsetNumberPrev(low);
+
 		return low;
+	}
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -792,25 +811,31 @@ _bt_compare(Relation rel,
 		 * doesn't have to descend left because it isn't interested in a match
 		 * that has a heap TID value of -inf.
 		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
+		 * We take a symmetrical approach in our handling of this boundary
+		 * case during backward scan searches (though with the same goal in
+		 * mind as the regular/!backward case).  Here we treat -inf as a
+		 * possible match for omitted scankey attribute(s).  The searcher
+		 * isn't really interested in matches with a heap TID value of -inf,
+		 * though; it's actually interested in non-pivot tuple matches to the
+		 * immediate left of the ('foo', -inf) pivot (the first leaf page
+		 * match returned by _bt_binsrch might be ('fog', '(42,7)'), for
+		 * example -- since that is "just to the left" of the insertion scan
+		 * key's attribute value(s)).  In other words, a backward scan search
+		 * should descend left directly because it is only interested in
+		 * values strictly < 'foo'.
+		 *
+		 * Note: Some backward searches are ineligible for this optimization.
+		 * It is only effective when nextkey = false and backward = true.
+		 * (When nextkey and backward are both true our behavior here cannot
+		 * and will not alter how _bt_search descends the tree.)
 		 *
 		 * Note: the heap TID part of the test ensures that scankey is being
 		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * attributes.  The heap TID attribute is the last key attribute in
+		 * every index, of course, but other than that it isn't special.
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		if (heapTid == NULL && key->keysz == ntupatts && !key->backward &&
+			key->heapkeyspace)
 			return 1;
 
 		/* All provided scankey arguments found to be equal */
@@ -875,12 +900,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
-	bool		nextkey;
-	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
-	int			keysCount = 0;
+	int			keysz = 0;
 	int			i;
 	bool		status;
 	StrategyNumber strat_total;
@@ -1015,7 +1038,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					 ScanDirectionIsBackward(dir)))
 				{
 					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
+					chosen = &notnullkeys[keysz];
 					ScanKeyEntryInitialize(chosen,
 										   (SK_SEARCHNOTNULL | SK_ISNULL |
 											(impliesNN->sk_flags &
@@ -1036,7 +1059,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (chosen == NULL)
 					break;
-				startKeys[keysCount++] = chosen;
+				startKeys[keysz++] = chosen;
 
 				/*
 				 * Adjust strat_total, and quit if we have stored a > or <
@@ -1110,7 +1133,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * the tree.  Walk down that edge to the first or last key, and scan from
 	 * there.
 	 */
-	if (keysCount == 0)
+	if (keysz == 0)
 	{
 		bool		match;
 
@@ -1132,8 +1155,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * identified by startKeys[].  (Remaining insertion scankey fields are
 	 * initialized after initial-positioning strategy is finalized.)
 	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
+	Assert(keysz <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysz; i++)
 	{
 		ScanKey		cur = startKeys[i];
 
@@ -1174,7 +1197,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * did use has to be treated as just a ">=" or "<=" condition, and
 			 * so we'd better adjust strat_total accordingly.
 			 */
-			if (i == keysCount - 1)
+			if (i == keysz - 1)
 			{
 				bool		used_all_subkeys = false;
 
@@ -1183,16 +1206,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				{
 					subkey++;
 					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
+					if (subkey->sk_attno != keysz + 1)
 						break;	/* out-of-sequence, can't use it */
 					if (subkey->sk_strategy != cur->sk_strategy)
 						break;	/* wrong direction, can't use it */
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
+					Assert(keysz < INDEX_MAX_KEYS);
+					memcpy(inskey.scankeys + keysz, subkey,
 						   sizeof(ScanKeyData));
-					keysCount++;
+					keysz++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
 						used_all_subkeys = true;
@@ -1273,40 +1296,26 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*----------
 	 * Examine the selected initial-positioning strategy to determine exactly
 	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
+	 * initial descent by _bt_search (and our _bt_binsrch call for the leaf
+	 * page _bt_search returns).
 	 *----------
 	 */
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	inskey.anynullkeys = false; /* unused */
+	inskey.scantid = NULL;
+	inskey.keysz = keysz;
 	switch (strat_total)
 	{
 		case BTLessStrategyNumber:
 
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
+			inskey.nextkey = false;
+			inskey.backward = true;
 			break;
 
 		case BTLessEqualStrategyNumber:
 
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
+			inskey.nextkey = true;
+			inskey.backward = true;
 			break;
 
 		case BTEqualStrategyNumber:
@@ -1318,41 +1327,37 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			if (ScanDirectionIsBackward(dir))
 			{
 				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
+				 * This is the same as the <= strategy
 				 */
-				nextkey = true;
-				goback = true;
+				inskey.nextkey = true;
+				inskey.backward = true;
 			}
 			else
 			{
 				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
+				 * This is the same as the >= strategy
 				 */
-				nextkey = false;
-				goback = false;
+				inskey.nextkey = false;
+				inskey.backward = false;
 			}
 			break;
 
 		case BTGreaterEqualStrategyNumber:
 
 			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
+			 * Find first item >= scankey
 			 */
-			nextkey = false;
-			goback = false;
+			inskey.nextkey = false;
+			inskey.backward = false;
 			break;
 
 		case BTGreaterStrategyNumber:
 
 			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
+			 * Find first item > scankey
 			 */
-			nextkey = true;
-			goback = false;
+			inskey.nextkey = true;
+			inskey.backward = false;
 			break;
 
 		default:
@@ -1361,18 +1366,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
-
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
+	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
 	stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
@@ -1403,34 +1401,26 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
-
-	/*
-	 * If nextkey = false, we are positioned at the first item >= scan key, or
-	 * possibly at the end of a page on which all the existing items are less
-	 * than the scan key and we know that everything on later pages is greater
-	 * than or equal to scan key.
-	 *
-	 * If nextkey = true, we are positioned at the first item > scan key, or
-	 * possibly at the end of a page on which all the existing items are less
-	 * than or equal to the scan key and we know that everything on later
-	 * pages is greater than scan key.
-	 *
-	 * The actually desired starting point is either this item or the prior
-	 * one, or in the end-of-page case it's the first item on the next page or
-	 * the last item on this page.  Adjust the starting offset if needed. (If
-	 * this results in an offset before the first item or after the last one,
-	 * _bt_readpage will report no items found, and then we'll step to the
-	 * next page as needed.)
-	 */
-	if (goback)
-		offnum = OffsetNumberPrev(offnum);
-
-	/* remember which buffer we have pinned, if any */
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 
 	/*
 	 * Now load data from the first page of the scan.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple >= inskey.scankeys.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple < inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple > inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple <= inskey.scankeys.
+	 *
+	 * Note: offnum will be "out of bounds for the leaf page" in certain
+	 * corner cases (e.g., page is empty).  _bt_readpage can deal with that.
 	 */
 	if (!_bt_readpage(scan, dir, offnum))
 	{
@@ -1521,6 +1511,11 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * Some callers pass us an offnum returned by _bt_binsrch, which can be either
+ * "maxoff+1" or "P_FIRSTDATAKEY-1" in certain edge cases.  When this happens
+ * we will report no items found, and then we'll step to the next page as
+ * needed.
+ *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
  * _bt_parallel_release before returning.
@@ -1596,6 +1591,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
+		/* _bt_binsrch might have returned offnum of "P_FIRSTDATAKEY-1" */
 		offnum = Max(offnum, minoff);
 
 		while (offnum <= maxoff)
@@ -1614,6 +1610,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
@@ -1688,6 +1685,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
+		/* _bt_binsrch might have returned offnum of "maxoff+1" */
 		offnum = Min(offnum, maxoff);
 
 		while (offnum >= minoff)
@@ -1722,6 +1720,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				tuple_alive = true;
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
 										 &continuescan);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..941594918 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -62,14 +62,6 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		When itup is a non-pivot tuple, the returned insertion scan key is
- *		suitable for finding a place for it to go on the leaf level.  Pivot
- *		tuples can be used to re-find leaf page with matching high key, but
- *		then caller needs to set scan key's pivotsearch field to true.  This
- *		allows caller to search for a leaf page with a matching high key,
- *		which is usually to the left of the first leaf page a non-pivot match
- *		might appear on.
- *
  *		The result is intended for use with _bt_compare() and _bt_truncate().
  *		Callers that don't need to fill out the insertion scankey arguments
  *		(e.g. they use an ad-hoc comparison routine, or only need a scankey
@@ -120,8 +112,8 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		key->allequalimage = false;
 	}
 	key->anynullkeys = false;	/* initial assumption */
-	key->nextkey = false;
-	key->pivotsearch = false;
+	key->nextkey = false;		/* usual case, required by btinsert */
+	key->backward = false;		/* usual case, required by btinsert */
 	key->keysz = Min(indnkeyatts, tupnatts);
 	key->scantid = key->heapkeyspace && itup ?
 		BTreeTupleGetHeapTID(itup) : NULL;
@@ -725,7 +717,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * failure of either key is indeed enough to stop the scan.  (In general, when
  * inequality keys are present, the initial-positioning code only promises to
  * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * for a forward scan; or after the last match for a backward scan.  Note,
+ * however, that there are various optimizations in the initial-positioning
+ * code that maximize the likelihood that _bt_search will return the leaf page
+ * whose key space makes it precisely the first/last leaf page where a match
+ * might be found.)
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 94a975932..d06891a2d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2779,7 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, state->targetblock, state->target,
@@ -2841,7 +2841,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
@@ -2864,7 +2864,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
@@ -2902,7 +2902,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
@@ -3128,9 +3128,9 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * For example, invariant_g_offset() might miss a cross-page invariant failure
  * on an internal level if the scankey built from the first item on the
  * target's right sibling page happened to be equal to (not greater than) the
- * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
- * might otherwise cause amcheck to assume (rather than actually verify) that
- * the scankey is greater.
+ * last item on target page.  The !backward tiebreaker in _bt_compare() might
+ * otherwise cause amcheck to assume (rather than actually verify) that the
+ * scankey is greater.
  */
 static inline BTScanInsert
 bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
@@ -3138,7 +3138,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 	BTScanInsert skey;
 
 	skey = _bt_mkscankey(rel, itup);
-	skey->pivotsearch = true;
+	skey->backward = true;
 
 	return skey;
 }
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 93ed5e8cc..4dcb9067a 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -142,6 +142,54 @@ SELECT b.*
   4500 | 2080851358
 (1 row)
 
+--
+-- Add coverage for optimization of backwards scan index descents
+--
+-- Here we expect _bt_search to descend straight to a leaf page containing a
+-- non-pivot tuple with the value '47', which comes last (after 11 similar
+-- non-pivot tuples).  Query execution should only need to visit a single
+-- leaf page here, even though we happen to be interested in a tuple that
+-- the nbtree implementation reasons about as a boundary case.
+--
+-- Test case relies on tenk1_hundred index having a leaf page whose high key
+-- is '(48, -inf)'.  We use a low cardinality index to make our test case less
+-- sensitive to implementation details that may change in the future.
+set enable_seqscan to false;
+set enable_indexscan to true;
+set enable_bitmapscan to false;
+explain (costs off)
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Limit
+   ->  Index Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred < 48)
+(3 rows)
+
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+ hundred | twenty 
+---------+--------
+      47 |      7
+(1 row)
+
+-- This variant of the query need only return a single tuple located to the immediate
+-- right of the '(48, -inf)' high key.  It also only needs to scan one single
+-- leaf page (the right sibling of the page scanned by the last test case):
+explain (costs off)
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Limit
+   ->  Index Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred <= 48)
+(3 rows)
+
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+ hundred | twenty 
+---------+--------
+      48 |      8
+(1 row)
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 239f4a475..2075b45a6 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -110,6 +110,32 @@ SELECT b.*
    FROM bt_f8_heap b
    WHERE b.seqno = '4500'::float8;
 
+--
+-- Add coverage for optimization of backwards scan index descents
+--
+-- Here we expect _bt_search to descend straight to a leaf page containing a
+-- non-pivot tuple with the value '47', which comes last (after 11 similar
+-- non-pivot tuples).  Query execution should only need to visit a single
+-- leaf page here, even though we happen to be interested in a tuple that
+-- the nbtree implementation reasons about as a boundary case.
+--
+-- Test case relies on tenk1_hundred index having a leaf page whose high key
+-- is '(48, -inf)'.  We use a low cardinality index to make our test case less
+-- sensitive to implementation details that may change in the future.
+set enable_seqscan to false;
+set enable_indexscan to true;
+set enable_bitmapscan to false;
+explain (costs off)
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+
+-- This variant of the query need only return a single tuple located to the immediate
+-- right of the '(48, -inf)' high key.  It also only needs to scan one single
+-- leaf page (the right sibling of the page scanned by the last test case):
+explain (costs off)
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
-- 
2.40.1

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#2)

1 attachment(s)

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

On Tue, Jun 20, 2023 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

I'm happy with the way that v2 came out, since the new structure makes
a lot more sense to me.

Attached is v3, which is a straightforward rebase of v2. v3 is needed
to get the patch to apply cleanly against HEAD - so no real changes
here.

--
Peter Geoghegan

Attachments:

v3-0001-Optimize-nbtree-backward-scan-boundary-cases.patchapplication/octet-stream; name=v3-0001-Optimize-nbtree-backward-scan-boundary-cases.patchDownload

From 3dd679680fe85cf290e6957e1ae0f4f627d467d9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 Jun 2023 09:58:43 -0700
Subject: [PATCH v3] Optimize nbtree backward scan boundary cases.

Teach _bt_search (and related helper routines like _bt_binsrch and
_bt_compare) about the initial positioning requirements of backward
scans.  This relieves _bt_first from having to deal with those details
for itself.

Now that certain implementation details are hidden from _bt_first, it's
easy to add a new optimization: backwards scans that use the < strategy
avoid uselessly scanning an extra leaf page in certain "boundary cases".

Consider the following example:

SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1;

Before this commit, nbtree would needlessly scan two leaf pages, even
though it was only really necessary to scan one leaf page (when the
relevant leaf page's high key had the value (12, -inf), as the relevant
tenk1 regression test index does).  nbtree failed to take advantage of
the fact that we can descend straight to the leaf page that contains the
(12, -inf) high key.

nbtree now descends to the left directly whenever it encounters any
query with a similar boundary case.  This relies on the fact that the
right sibling page cannot possibly contain matching non-pivot tuples.
It only applies to backward scans that use the < strategy -- it does not
apply to backward scans that use the <= strategy (since there might be
matches on both sides of the leaf page high key with the <= strategy).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
---
 src/include/access/nbtree.h               |  11 +-
 src/backend/access/nbtree/nbtpage.c       |  16 +-
 src/backend/access/nbtree/nbtsearch.c     | 223 +++++++++++-----------
 src/backend/access/nbtree/nbtutils.c      |  18 +-
 contrib/amcheck/verify_nbtree.c           |  16 +-
 src/test/regress/expected/btree_index.out |  48 +++++
 src/test/regress/sql/btree_index.sql      |  26 +++
 7 files changed, 218 insertions(+), 140 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5c66964c..27eb3f093 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -763,13 +763,8 @@ typedef BTStackData *BTStack;
  * bit, but may not when inserting into an INCLUDE index (tuple header value
  * is affected by the NULL-ness of both key and non-key attributes).
  *
- * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
- * locate the first item >= scankey.  When nextkey is true, they will locate
- * the first item > scan key.
- *
- * pivotsearch is set to true by callers that want to re-find a leaf page
- * using a scankey built from a leaf page's high key.  Most callers set this
- * to false.
+ * See comments in _bt_first for an explanation of the nextkey and backward
+ * fields.
  *
  * scantid is the heap TID that is used as a final tiebreaker attribute.  It
  * is set to NULL when index scan doesn't need to find a position for a
@@ -792,7 +787,7 @@ typedef struct BTScanInsertData
 	bool		allequalimage;
 	bool		anynullkeys;
 	bool		nextkey;
-	bool		pivotsearch;
+	bool		backward;		/* backward index scan? */
 	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index b7660a459..86ada8a19 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1958,10 +1958,20 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 					return;
 				}
 
-				/* we need an insertion scan key for the search, so build one */
+				/*
+				 * We need an insertion scan key for the search, so build one.
+				 *
+				 * _bt_search searches for the leaf page with the first
+				 * matching non-pivot tuple.  We intend to "search" for a
+				 * matching pivot tuple in the leaf page underdoing deletion.
+				 *
+				 * Compensate for the mismatch by specifying backward=true.
+				 * That makes _bt_search locate the page < the position where
+				 * matching non-pivot tuples go.  _bt_search handles pivot
+				 * "boundary cases" intelligently, so this does what we need.
+				 */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page with matching pivot/high key */
-				itup_key->pivotsearch = true;
+				itup_key->backward = true;
 				stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749..9ab8bea52 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -318,9 +318,12 @@ _bt_moveright(Relation rel,
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
+ * key >= given scankey, or > scankey if nextkey is true for forward scans.
+ * _bt_binsrch() also "steps back" by one item/tuple on the leaf level in the
+ * case of backward scans.  (NOTE: this means it is possible to return a value
+ * that's 1 greater than the number of keys on the leaf page.  It also means
+ * that we can return an item 1 less than the first non-pivot tuple on any
+ * leaf page.)
  *
  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
  * of the last key < given scankey, or last key <= given scankey if nextkey
@@ -362,7 +365,7 @@ _bt_binsrch(Relation rel,
 	 * this covers two cases: the page is really empty (no keys), or it
 	 * contains only a high key.  The latter case is possible after vacuuming.
 	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
+	 * never empty (an internal page must have at least one child).
 	 */
 	if (unlikely(high < low))
 		return low;
@@ -402,10 +405,27 @@ _bt_binsrch(Relation rel,
 	 * past the last slot on the page.
 	 *
 	 * On a leaf page, we always return the first key >= scan key (resp. >
-	 * scan key), which could be the last slot + 1.
+	 * scan key) for forward scan callers, which could be the last slot + 1.
 	 */
 	if (P_ISLEAF(opaque))
+	{
+		/*
+		 * Backward scans should (by definition) be initially positioned after
+		 * the final non-pivot tuple from the equal-to-scankey grouping on the
+		 * leaf level (with forward scans it's before the first such tuple).
+		 * This explanation works independently of the scan's nextkey-ness
+		 * (actually, nextkey-ness makes the grouping greater-than-scankey,
+		 * but there's still a grouping to position ourselves relative to).
+		 *
+		 * The backward scan case is now "at the start of the wrong grouping".
+		 * It must now step back to "the end of the correct grouping".  This
+		 * leaves us at the correct final offnum from the correct leaf page.
+		 */
+		if (unlikely(key->backward))
+			return OffsetNumberPrev(low);
+
 		return low;
+	}
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -767,7 +787,7 @@ _bt_compare(Relation rel,
 	if (key->scantid == NULL)
 	{
 		/*
-		 * Most searches have a scankey that is considered greater than a
+		 * Forward scans have a scankey that is considered greater than a
 		 * truncated pivot tuple if and when the scankey has equal values for
 		 * attributes up to and including the least significant untruncated
 		 * attribute in tuple.
@@ -782,25 +802,33 @@ _bt_compare(Relation rel,
 		 * doesn't have to descend left because it isn't interested in a match
 		 * that has a heap TID value of -inf.
 		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
+		 * We take a symmetrical approach in our handling of this boundary
+		 * case during backward scan searches (with the same goal in mind).
+		 * Here we treat -inf as a possible match for omitted scankey suffix
+		 * attribute(s).  The searcher isn't really interested in matches with
+		 * a heap TID value of -inf, though; it's actually interested in any
+		 * non-pivot tuple match to the immediate left of the ('foo', -inf)
+		 * pivot.  The first leaf page match returned by _bt_binsrch might be
+		 * ('fog', '(4223,7)'), for example.  In other words, backward scan
+		 * search doesn't have to descend right because it is only interested
+		 * in values strictly < 'foo'.
+		 *
+		 * Note: We can exploit invariants for the >= 'foo' and < 'foo' cases
+		 * like this because pivot tuples usually have -inf sentinel values.
+		 * This isn't possible for the <= 'foo' case because suffix truncation
+		 * will never make truncated key attributes have +inf sentinel values.
+		 * When nextkey and backward are both true, our behavior here cannot
+		 * and will not alter how _bt_search descends the tree.  In other
+		 * words, when scan is interested in <= 'foo' matches we must assume
+		 * that leaf pages on both sides of ('foo', -inf) pivot are relevant.
 		 *
 		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * compared to a pivot tuple with one or more truncated -inf key
+		 * attributes.  The heap TID attribute is the last key attribute in
+		 * every index, of course, but other than that it isn't special.
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		if (!key->backward && key->keysz == ntupatts && heapTid == NULL &&
+			key->heapkeyspace)
 			return 1;
 
 		/* All provided scankey arguments found to be equal */
@@ -865,12 +893,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
-	bool		nextkey;
-	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
-	int			keysCount = 0;
+	int			keysz = 0;
 	int			i;
 	bool		status;
 	StrategyNumber strat_total;
@@ -1005,7 +1031,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					 ScanDirectionIsBackward(dir)))
 				{
 					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
+					chosen = &notnullkeys[keysz];
 					ScanKeyEntryInitialize(chosen,
 										   (SK_SEARCHNOTNULL | SK_ISNULL |
 											(impliesNN->sk_flags &
@@ -1026,7 +1052,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (chosen == NULL)
 					break;
-				startKeys[keysCount++] = chosen;
+				startKeys[keysz++] = chosen;
 
 				/*
 				 * Adjust strat_total, and quit if we have stored a > or <
@@ -1100,7 +1126,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * the tree.  Walk down that edge to the first or last key, and scan from
 	 * there.
 	 */
-	if (keysCount == 0)
+	if (keysz == 0)
 	{
 		bool		match;
 
@@ -1122,8 +1148,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * identified by startKeys[].  (Remaining insertion scankey fields are
 	 * initialized after initial-positioning strategy is finalized.)
 	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
+	Assert(keysz <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysz; i++)
 	{
 		ScanKey		cur = startKeys[i];
 
@@ -1164,7 +1190,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * did use has to be treated as just a ">=" or "<=" condition, and
 			 * so we'd better adjust strat_total accordingly.
 			 */
-			if (i == keysCount - 1)
+			if (i == keysz - 1)
 			{
 				bool		used_all_subkeys = false;
 
@@ -1173,16 +1199,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				{
 					subkey++;
 					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
+					if (subkey->sk_attno != keysz + 1)
 						break;	/* out-of-sequence, can't use it */
 					if (subkey->sk_strategy != cur->sk_strategy)
 						break;	/* wrong direction, can't use it */
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
+					Assert(keysz < INDEX_MAX_KEYS);
+					memcpy(inskey.scankeys + keysz, subkey,
 						   sizeof(ScanKeyData));
-					keysCount++;
+					keysz++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
 						used_all_subkeys = true;
@@ -1263,40 +1289,26 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*----------
 	 * Examine the selected initial-positioning strategy to determine exactly
 	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
+	 * initial descent by _bt_search (and our _bt_binsrch call for the leaf
+	 * page _bt_search returns).
 	 *----------
 	 */
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	inskey.anynullkeys = false; /* unused */
+	inskey.scantid = NULL;
+	inskey.keysz = keysz;
 	switch (strat_total)
 	{
 		case BTLessStrategyNumber:
 
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
+			inskey.nextkey = false;
+			inskey.backward = true;
 			break;
 
 		case BTLessEqualStrategyNumber:
 
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
+			inskey.nextkey = true;
+			inskey.backward = true;
 			break;
 
 		case BTEqualStrategyNumber:
@@ -1308,41 +1320,37 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			if (ScanDirectionIsBackward(dir))
 			{
 				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
+				 * This is the same as the <= strategy
 				 */
-				nextkey = true;
-				goback = true;
+				inskey.nextkey = true;
+				inskey.backward = true;
 			}
 			else
 			{
 				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
+				 * This is the same as the >= strategy
 				 */
-				nextkey = false;
-				goback = false;
+				inskey.nextkey = false;
+				inskey.backward = false;
 			}
 			break;
 
 		case BTGreaterEqualStrategyNumber:
 
 			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
+			 * Find first item >= scankey
 			 */
-			nextkey = false;
-			goback = false;
+			inskey.nextkey = false;
+			inskey.backward = false;
 			break;
 
 		case BTGreaterStrategyNumber:
 
 			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
+			 * Find first item > scankey
 			 */
-			nextkey = true;
-			goback = false;
+			inskey.nextkey = true;
+			inskey.backward = false;
 			break;
 
 		default:
@@ -1351,18 +1359,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
-
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
+	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
 	stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ);
 
 	/* don't need to keep the stack around... */
@@ -1404,34 +1405,29 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
-
-	/*
-	 * If nextkey = false, we are positioned at the first item >= scan key, or
-	 * possibly at the end of a page on which all the existing items are less
-	 * than the scan key and we know that everything on later pages is greater
-	 * than or equal to scan key.
-	 *
-	 * If nextkey = true, we are positioned at the first item > scan key, or
-	 * possibly at the end of a page on which all the existing items are less
-	 * than or equal to the scan key and we know that everything on later
-	 * pages is greater than scan key.
-	 *
-	 * The actually desired starting point is either this item or the prior
-	 * one, or in the end-of-page case it's the first item on the next page or
-	 * the last item on this page.  Adjust the starting offset if needed. (If
-	 * this results in an offset before the first item or after the last one,
-	 * _bt_readpage will report no items found, and then we'll step to the
-	 * next page as needed.)
-	 */
-	if (goback)
-		offnum = OffsetNumberPrev(offnum);
-
-	/* remember which buffer we have pinned, if any */
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 
 	/*
 	 * Now load data from the first page of the scan.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple >= inskey.scankeys.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple < inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple > inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple <= inskey.scankeys.
+	 *
+	 * Note: if there isn't an existing non-pivot tuple that is fully equal to
+	 * inskey.scankeys according to _bt_compare, then offnum is the would-be
+	 * offset -- the offset that btinsert would have to use when inserting a
+	 * matching non-pivot tuple onto the page, backfilled with -inf values for
+	 * omitted suffix key attributes (+inf values in the backward scan case).
 	 */
 	if (!_bt_readpage(scan, dir, offnum))
 	{
@@ -1445,7 +1441,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
+		/* We have at least one item to return as scan's first item */
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
@@ -1522,6 +1518,11 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * Some callers pass us an offnum returned by _bt_binsrch, which can be an
+ * "invalid" offnum such as "maxoff+1" in certain corner cases.  We deal with
+ * that by only checking the keys from the offnum tuple after we're sure that
+ * it points to a valid non-pivot tuple (which is a good idea anyway).
+ *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
  * _bt_parallel_release before returning.
@@ -1615,6 +1616,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
@@ -1723,6 +1725,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				tuple_alive = true;
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
 										 &continuescan);
@@ -1971,7 +1974,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	if (!_bt_readnextpage(scan, blkno, dir))
 		return false;
 
-	/* Drop the lock, and maybe the pin, on the current page */
+	/* We have at least one item to return as scan's next item */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
 	return true;
@@ -2172,7 +2175,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	if (!_bt_readnextpage(scan, blkno, dir))
 		return false;
 
-	/* Drop the lock, and maybe the pin, on the current page */
+	/* We have at least one item to return as scan's next item */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
 	return true;
@@ -2461,7 +2464,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
+		/* We have at least one item to return as scan's first item */
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4d..941594918 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -62,14 +62,6 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		When itup is a non-pivot tuple, the returned insertion scan key is
- *		suitable for finding a place for it to go on the leaf level.  Pivot
- *		tuples can be used to re-find leaf page with matching high key, but
- *		then caller needs to set scan key's pivotsearch field to true.  This
- *		allows caller to search for a leaf page with a matching high key,
- *		which is usually to the left of the first leaf page a non-pivot match
- *		might appear on.
- *
  *		The result is intended for use with _bt_compare() and _bt_truncate().
  *		Callers that don't need to fill out the insertion scankey arguments
  *		(e.g. they use an ad-hoc comparison routine, or only need a scankey
@@ -120,8 +112,8 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		key->allequalimage = false;
 	}
 	key->anynullkeys = false;	/* initial assumption */
-	key->nextkey = false;
-	key->pivotsearch = false;
+	key->nextkey = false;		/* usual case, required by btinsert */
+	key->backward = false;		/* usual case, required by btinsert */
 	key->keysz = Min(indnkeyatts, tupnatts);
 	key->scantid = key->heapkeyspace && itup ?
 		BTreeTupleGetHeapTID(itup) : NULL;
@@ -725,7 +717,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * failure of either key is indeed enough to stop the scan.  (In general, when
  * inequality keys are present, the initial-positioning code only promises to
  * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * for a forward scan; or after the last match for a backward scan.  Note,
+ * however, that there are various optimizations in the initial-positioning
+ * code that maximize the likelihood that _bt_search will return the leaf page
+ * whose key space makes it precisely the first/last leaf page where a match
+ * might be found.)
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index dbb83d80f..071998bc0 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2779,7 +2779,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, state->targetblock, state->target,
@@ -2841,7 +2841,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
@@ -2864,7 +2864,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
@@ -2902,7 +2902,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
@@ -3128,9 +3128,9 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * For example, invariant_g_offset() might miss a cross-page invariant failure
  * on an internal level if the scankey built from the first item on the
  * target's right sibling page happened to be equal to (not greater than) the
- * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
- * might otherwise cause amcheck to assume (rather than actually verify) that
- * the scankey is greater.
+ * last item on target page.  The !backward tiebreaker in _bt_compare() might
+ * otherwise cause amcheck to assume (rather than actually verify) that the
+ * scankey is greater.
  */
 static inline BTScanInsert
 bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
@@ -3138,7 +3138,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 	BTScanInsert skey;
 
 	skey = _bt_mkscankey(rel, itup);
-	skey->pivotsearch = true;
+	skey->backward = true;
 
 	return skey;
 }
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 93ed5e8cc..4dcb9067a 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -142,6 +142,54 @@ SELECT b.*
   4500 | 2080851358
 (1 row)
 
+--
+-- Add coverage for optimization of backwards scan index descents
+--
+-- Here we expect _bt_search to descend straight to a leaf page containing a
+-- non-pivot tuple with the value '47', which comes last (after 11 similar
+-- non-pivot tuples).  Query execution should only need to visit a single
+-- leaf page here, even though we happen to be interested in a tuple that
+-- the nbtree implementation reasons about as a boundary case.
+--
+-- Test case relies on tenk1_hundred index having a leaf page whose high key
+-- is '(48, -inf)'.  We use a low cardinality index to make our test case less
+-- sensitive to implementation details that may change in the future.
+set enable_seqscan to false;
+set enable_indexscan to true;
+set enable_bitmapscan to false;
+explain (costs off)
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Limit
+   ->  Index Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred < 48)
+(3 rows)
+
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+ hundred | twenty 
+---------+--------
+      47 |      7
+(1 row)
+
+-- This variant of the query need only return a single tuple located to the immediate
+-- right of the '(48, -inf)' high key.  It also only needs to scan one single
+-- leaf page (the right sibling of the page scanned by the last test case):
+explain (costs off)
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Limit
+   ->  Index Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred <= 48)
+(3 rows)
+
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+ hundred | twenty 
+---------+--------
+      48 |      8
+(1 row)
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 239f4a475..2075b45a6 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -110,6 +110,32 @@ SELECT b.*
    FROM bt_f8_heap b
    WHERE b.seqno = '4500'::float8;
 
+--
+-- Add coverage for optimization of backwards scan index descents
+--
+-- Here we expect _bt_search to descend straight to a leaf page containing a
+-- non-pivot tuple with the value '47', which comes last (after 11 similar
+-- non-pivot tuples).  Query execution should only need to visit a single
+-- leaf page here, even though we happen to be interested in a tuple that
+-- the nbtree implementation reasons about as a boundary case.
+--
+-- Test case relies on tenk1_hundred index having a leaf page whose high key
+-- is '(48, -inf)'.  We use a low cardinality index to make our test case less
+-- sensitive to implementation details that may change in the future.
+set enable_seqscan to false;
+set enable_indexscan to true;
+set enable_bitmapscan to false;
+explain (costs off)
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+
+-- This variant of the query need only return a single tuple located to the immediate
+-- right of the '(48, -inf)' high key.  It also only needs to scan one single
+-- leaf page (the right sibling of the page scanned by the last test case):
+explain (costs off)
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
-- 
2.40.1

Peter Geoghegan

pg@bowt.ie

about 2 years ago

In reply to: Peter Geoghegan (#3)

1 attachment(s)

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

On Mon, Sep 18, 2023 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3, which is a straightforward rebase of v2. v3 is needed
to get the patch to apply cleanly against HEAD - so no real changes
here.

Attached is v4. Just to keep CFTester happy.

--
Peter Geoghegan

Attachments:

v4-0001-Optimize-nbtree-backward-scan-boundary-cases.patchapplication/octet-stream; name=v4-0001-Optimize-nbtree-backward-scan-boundary-cases.patchDownload

From 2b59785ed2605d4f37a40b5f852052e00d1ccc38 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 Jun 2023 09:58:43 -0700
Subject: [PATCH v4] Optimize nbtree backward scan boundary cases.

Teach _bt_search (and related helper routines like _bt_binsrch and
_bt_compare) about the initial positioning requirements of backward
scans.  This relieves _bt_first from having to deal with those details
for itself.

Now that certain implementation details are hidden from _bt_first, it's
easy to add a new optimization: backwards scans that use the < strategy
avoid uselessly scanning an extra leaf page in certain "boundary cases".

Consider the following example:

SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1;

Before this commit, nbtree would needlessly scan two leaf pages, even
though it was only really necessary to scan one leaf page (when the
relevant leaf page's high key had the value (12, -inf), as the relevant
tenk1 regression test index does).  nbtree failed to take advantage of
the fact that we can descend straight to the leaf page that contains the
(12, -inf) high key.

nbtree now descends to the left directly whenever it encounters any
query with a similar boundary case.  This relies on the fact that the
right sibling page cannot possibly contain matching non-pivot tuples.
It only applies to backward scans that use the < strategy -- it does not
apply to backward scans that use the <= strategy (since there might be
matches on both sides of the leaf page high key with the <= strategy).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
---
 src/include/access/nbtree.h               |  11 +-
 src/backend/access/nbtree/nbtpage.c       |  16 +-
 src/backend/access/nbtree/nbtsearch.c     | 223 +++++++++++-----------
 src/backend/access/nbtree/nbtutils.c      |  18 +-
 contrib/amcheck/verify_nbtree.c           |  16 +-
 src/test/regress/expected/btree_index.out |  48 +++++
 src/test/regress/sql/btree_index.sql      |  26 +++
 7 files changed, 218 insertions(+), 140 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086..5e083591a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -763,13 +763,8 @@ typedef BTStackData *BTStack;
  * bit, but may not when inserting into an INCLUDE index (tuple header value
  * is affected by the NULL-ness of both key and non-key attributes).
  *
- * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
- * locate the first item >= scankey.  When nextkey is true, they will locate
- * the first item > scan key.
- *
- * pivotsearch is set to true by callers that want to re-find a leaf page
- * using a scankey built from a leaf page's high key.  Most callers set this
- * to false.
+ * See comments in _bt_first for an explanation of the nextkey and backward
+ * fields.
  *
  * scantid is the heap TID that is used as a final tiebreaker attribute.  It
  * is set to NULL when index scan doesn't need to find a position for a
@@ -792,7 +787,7 @@ typedef struct BTScanInsertData
 	bool		allequalimage;
 	bool		anynullkeys;
 	bool		nextkey;
-	bool		pivotsearch;
+	bool		backward;		/* backward index scan? */
 	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index b7660a459..86ada8a19 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1958,10 +1958,20 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 					return;
 				}
 
-				/* we need an insertion scan key for the search, so build one */
+				/*
+				 * We need an insertion scan key for the search, so build one.
+				 *
+				 * _bt_search searches for the leaf page with the first
+				 * matching non-pivot tuple.  We intend to "search" for a
+				 * matching pivot tuple in the leaf page underdoing deletion.
+				 *
+				 * Compensate for the mismatch by specifying backward=true.
+				 * That makes _bt_search locate the page < the position where
+				 * matching non-pivot tuples go.  _bt_search handles pivot
+				 * "boundary cases" intelligently, so this does what we need.
+				 */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page with matching pivot/high key */
-				itup_key->pivotsearch = true;
+				itup_key->backward = true;
 				stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ);
 				/* won't need a second lock or pin on leafbuf */
 				_bt_relbuf(rel, sleafbuf);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5..73c403724 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -318,9 +318,12 @@ _bt_moveright(Relation rel,
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
- * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
- * particular, this means it is possible to return a value 1 greater than the
- * number of keys on the page, if the scankey is > all keys on the page.)
+ * key >= given scankey, or > scankey if nextkey is true for forward scans.
+ * _bt_binsrch() also "steps back" by one item/tuple on the leaf level in the
+ * case of backward scans.  (NOTE: this means it is possible to return a value
+ * that's 1 greater than the number of keys on the leaf page.  It also means
+ * that we can return an item 1 less than the first non-pivot tuple on any
+ * leaf page.)
  *
  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
  * of the last key < given scankey, or last key <= given scankey if nextkey
@@ -362,7 +365,7 @@ _bt_binsrch(Relation rel,
 	 * this covers two cases: the page is really empty (no keys), or it
 	 * contains only a high key.  The latter case is possible after vacuuming.
 	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
+	 * never empty (an internal page must have at least one child).
 	 */
 	if (unlikely(high < low))
 		return low;
@@ -402,10 +405,27 @@ _bt_binsrch(Relation rel,
 	 * past the last slot on the page.
 	 *
 	 * On a leaf page, we always return the first key >= scan key (resp. >
-	 * scan key), which could be the last slot + 1.
+	 * scan key) for forward scan callers, which could be the last slot + 1.
 	 */
 	if (P_ISLEAF(opaque))
+	{
+		/*
+		 * Backward scans should (by definition) be initially positioned after
+		 * the final non-pivot tuple from the equal-to-scankey grouping on the
+		 * leaf level (with forward scans it's before the first such tuple).
+		 * This explanation works independently of the scan's nextkey-ness
+		 * (actually, nextkey-ness makes the grouping greater-than-scankey,
+		 * but there's still a grouping to position ourselves relative to).
+		 *
+		 * The backward scan case is now "at the start of the wrong grouping".
+		 * It must now step back to "the end of the correct grouping".  This
+		 * leaves us at the correct final offnum from the correct leaf page.
+		 */
+		if (unlikely(key->backward))
+			return OffsetNumberPrev(low);
+
 		return low;
+	}
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -767,7 +787,7 @@ _bt_compare(Relation rel,
 	if (key->scantid == NULL)
 	{
 		/*
-		 * Most searches have a scankey that is considered greater than a
+		 * Forward scans have a scankey that is considered greater than a
 		 * truncated pivot tuple if and when the scankey has equal values for
 		 * attributes up to and including the least significant untruncated
 		 * attribute in tuple.
@@ -782,25 +802,33 @@ _bt_compare(Relation rel,
 		 * doesn't have to descend left because it isn't interested in a match
 		 * that has a heap TID value of -inf.
 		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
+		 * We take a symmetrical approach in our handling of this boundary
+		 * case during backward scan searches (with the same goal in mind).
+		 * Here we treat -inf as a possible match for omitted scankey suffix
+		 * attribute(s).  The searcher isn't really interested in matches with
+		 * a heap TID value of -inf, though; it's actually interested in any
+		 * non-pivot tuple match to the immediate left of the ('foo', -inf)
+		 * pivot.  The first leaf page match returned by _bt_binsrch might be
+		 * ('fog', '(4223,7)'), for example.  In other words, backward scan
+		 * search doesn't have to descend right because it is only interested
+		 * in values strictly < 'foo'.
+		 *
+		 * Note: We can exploit invariants for the >= 'foo' and < 'foo' cases
+		 * like this because pivot tuples usually have -inf sentinel values.
+		 * This isn't possible for the <= 'foo' case because suffix truncation
+		 * will never make truncated key attributes have +inf sentinel values.
+		 * When nextkey and backward are both true, our behavior here cannot
+		 * and will not alter how _bt_search descends the tree.  In other
+		 * words, when scan is interested in <= 'foo' matches we must assume
+		 * that leaf pages on both sides of ('foo', -inf) pivot are relevant.
 		 *
 		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * compared to a pivot tuple with one or more truncated -inf key
+		 * attributes.  The heap TID attribute is the last key attribute in
+		 * every index, of course, but other than that it isn't special.
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		if (!key->backward && key->keysz == ntupatts && heapTid == NULL &&
+			key->heapkeyspace)
 			return 1;
 
 		/* All provided scankey arguments found to be equal */
@@ -865,12 +893,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
-	bool		nextkey;
-	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
-	int			keysCount = 0;
+	int			keysz = 0;
 	int			i;
 	bool		status;
 	StrategyNumber strat_total;
@@ -1005,7 +1031,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					 ScanDirectionIsBackward(dir)))
 				{
 					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
+					chosen = &notnullkeys[keysz];
 					ScanKeyEntryInitialize(chosen,
 										   (SK_SEARCHNOTNULL | SK_ISNULL |
 											(impliesNN->sk_flags &
@@ -1026,7 +1052,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (chosen == NULL)
 					break;
-				startKeys[keysCount++] = chosen;
+				startKeys[keysz++] = chosen;
 
 				/*
 				 * Adjust strat_total, and quit if we have stored a > or <
@@ -1100,7 +1126,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * the tree.  Walk down that edge to the first or last key, and scan from
 	 * there.
 	 */
-	if (keysCount == 0)
+	if (keysz == 0)
 	{
 		bool		match;
 
@@ -1122,8 +1148,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * identified by startKeys[].  (Remaining insertion scankey fields are
 	 * initialized after initial-positioning strategy is finalized.)
 	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
+	Assert(keysz <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysz; i++)
 	{
 		ScanKey		cur = startKeys[i];
 
@@ -1164,7 +1190,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * did use has to be treated as just a ">=" or "<=" condition, and
 			 * so we'd better adjust strat_total accordingly.
 			 */
-			if (i == keysCount - 1)
+			if (i == keysz - 1)
 			{
 				bool		used_all_subkeys = false;
 
@@ -1173,16 +1199,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				{
 					subkey++;
 					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
+					if (subkey->sk_attno != keysz + 1)
 						break;	/* out-of-sequence, can't use it */
 					if (subkey->sk_strategy != cur->sk_strategy)
 						break;	/* wrong direction, can't use it */
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
+					Assert(keysz < INDEX_MAX_KEYS);
+					memcpy(inskey.scankeys + keysz, subkey,
 						   sizeof(ScanKeyData));
-					keysCount++;
+					keysz++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
 						used_all_subkeys = true;
@@ -1263,40 +1289,26 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*----------
 	 * Examine the selected initial-positioning strategy to determine exactly
 	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
+	 * initial descent by _bt_search (and our _bt_binsrch call for the leaf
+	 * page _bt_search returns).
 	 *----------
 	 */
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	inskey.anynullkeys = false; /* unused */
+	inskey.scantid = NULL;
+	inskey.keysz = keysz;
 	switch (strat_total)
 	{
 		case BTLessStrategyNumber:
 
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
+			inskey.nextkey = false;
+			inskey.backward = true;
 			break;
 
 		case BTLessEqualStrategyNumber:
 
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
+			inskey.nextkey = true;
+			inskey.backward = true;
 			break;
 
 		case BTEqualStrategyNumber:
@@ -1308,41 +1320,37 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			if (ScanDirectionIsBackward(dir))
 			{
 				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
+				 * This is the same as the <= strategy
 				 */
-				nextkey = true;
-				goback = true;
+				inskey.nextkey = true;
+				inskey.backward = true;
 			}
 			else
 			{
 				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
+				 * This is the same as the >= strategy
 				 */
-				nextkey = false;
-				goback = false;
+				inskey.nextkey = false;
+				inskey.backward = false;
 			}
 			break;
 
 		case BTGreaterEqualStrategyNumber:
 
 			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
+			 * Find first item >= scankey
 			 */
-			nextkey = false;
-			goback = false;
+			inskey.nextkey = false;
+			inskey.backward = false;
 			break;
 
 		case BTGreaterStrategyNumber:
 
 			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
+			 * Find first item > scankey
 			 */
-			nextkey = true;
-			goback = false;
+			inskey.nextkey = true;
+			inskey.backward = false;
 			break;
 
 		default:
@@ -1351,18 +1359,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
-
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
+	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
 	stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ);
 
 	/* don't need to keep the stack around... */
@@ -1404,35 +1405,30 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
-
-	/*
-	 * If nextkey = false, we are positioned at the first item >= scan key, or
-	 * possibly at the end of a page on which all the existing items are less
-	 * than the scan key and we know that everything on later pages is greater
-	 * than or equal to scan key.
-	 *
-	 * If nextkey = true, we are positioned at the first item > scan key, or
-	 * possibly at the end of a page on which all the existing items are less
-	 * than or equal to the scan key and we know that everything on later
-	 * pages is greater than scan key.
-	 *
-	 * The actually desired starting point is either this item or the prior
-	 * one, or in the end-of-page case it's the first item on the next page or
-	 * the last item on this page.  Adjust the starting offset if needed. (If
-	 * this results in an offset before the first item or after the last one,
-	 * _bt_readpage will report no items found, and then we'll step to the
-	 * next page as needed.)
-	 */
-	if (goback)
-		offnum = OffsetNumberPrev(offnum);
-
-	/* remember which buffer we have pinned, if any */
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 	so->firstPage = true;
 
 	/*
 	 * Now load data from the first page of the scan.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple >= inskey.scankeys.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple < inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple > inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple <= inskey.scankeys.
+	 *
+	 * Note: if there isn't an existing non-pivot tuple that is fully equal to
+	 * inskey.scankeys according to _bt_compare, then offnum is the would-be
+	 * offset -- the offset that btinsert would have to use when inserting a
+	 * matching non-pivot tuple onto the page, backfilled with -inf values for
+	 * omitted suffix key attributes (+inf values in the backward scan case).
 	 */
 	if (!_bt_readpage(scan, dir, offnum))
 	{
@@ -1446,7 +1442,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
+		/* We have at least one item to return as scan's first item */
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
@@ -1523,6 +1519,11 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * Some callers pass us an offnum returned by _bt_binsrch, which can be an
+ * "invalid" offnum such as "maxoff+1" in certain corner cases.  We deal with
+ * that by only checking the keys from the offnum tuple after we're sure that
+ * it points to a valid non-pivot tuple (which is a good idea anyway).
+ *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
  * _bt_parallel_release before returning.
@@ -1658,6 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
 										 &continuescan, requiredMatchedByPrecheck);
@@ -1777,6 +1779,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				tuple_alive = true;
 
 			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
 
 			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
 										 &continuescan, requiredMatchedByPrecheck);
@@ -2034,7 +2037,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	if (!_bt_readnextpage(scan, blkno, dir))
 		return false;
 
-	/* Drop the lock, and maybe the pin, on the current page */
+	/* We have at least one item to return as scan's next item */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
 	return true;
@@ -2235,7 +2238,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	if (!_bt_readnextpage(scan, blkno, dir))
 		return false;
 
-	/* Drop the lock, and maybe the pin, on the current page */
+	/* We have at least one item to return as scan's next item */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
 	return true;
@@ -2525,7 +2528,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
+		/* We have at least one item to return as scan's first item */
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1510b97fb..6909c864f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -62,14 +62,6 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		When itup is a non-pivot tuple, the returned insertion scan key is
- *		suitable for finding a place for it to go on the leaf level.  Pivot
- *		tuples can be used to re-find leaf page with matching high key, but
- *		then caller needs to set scan key's pivotsearch field to true.  This
- *		allows caller to search for a leaf page with a matching high key,
- *		which is usually to the left of the first leaf page a non-pivot match
- *		might appear on.
- *
  *		The result is intended for use with _bt_compare() and _bt_truncate().
  *		Callers that don't need to fill out the insertion scankey arguments
  *		(e.g. they use an ad-hoc comparison routine, or only need a scankey
@@ -120,8 +112,8 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		key->allequalimage = false;
 	}
 	key->anynullkeys = false;	/* initial assumption */
-	key->nextkey = false;
-	key->pivotsearch = false;
+	key->nextkey = false;		/* usual case, required by btinsert */
+	key->backward = false;		/* usual case, required by btinsert */
 	key->keysz = Min(indnkeyatts, tupnatts);
 	key->scantid = key->heapkeyspace && itup ?
 		BTreeTupleGetHeapTID(itup) : NULL;
@@ -740,7 +732,11 @@ _bt_restore_array_keys(IndexScanDesc scan)
  * failure of either key is indeed enough to stop the scan.  (In general, when
  * inequality keys are present, the initial-positioning code only promises to
  * position before the first possible match, not exactly at the first match,
- * for a forward scan; or after the last match for a backward scan.)
+ * for a forward scan; or after the last match for a backward scan.  Note,
+ * however, that there are various optimizations in the initial-positioning
+ * code that maximize the likelihood that _bt_search will return the leaf page
+ * whose key space makes it precisely the first/last leaf page where a match
+ * might be found.)
  *
  * As a byproduct of this work, we can detect contradictory quals such
  * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3e07a3e35..70e70af9d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2790,7 +2790,7 @@ invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, state->targetblock, state->target,
@@ -2852,7 +2852,7 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
@@ -2875,7 +2875,7 @@ invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
@@ -2913,7 +2913,7 @@ invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 	ItemId		itemid;
 	int32		cmp;
 
-	Assert(key->pivotsearch);
+	Assert(!key->nextkey && key->backward);
 
 	/* Verify line pointer before checking tuple */
 	itemid = PageGetItemIdCareful(state, nontargetblock, nontarget,
@@ -3139,9 +3139,9 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * For example, invariant_g_offset() might miss a cross-page invariant failure
  * on an internal level if the scankey built from the first item on the
  * target's right sibling page happened to be equal to (not greater than) the
- * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
- * might otherwise cause amcheck to assume (rather than actually verify) that
- * the scankey is greater.
+ * last item on target page.  The !backward tiebreaker in _bt_compare() might
+ * otherwise cause amcheck to assume (rather than actually verify) that the
+ * scankey is greater.
  */
 static inline BTScanInsert
 bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
@@ -3149,7 +3149,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 	BTScanInsert skey;
 
 	skey = _bt_mkscankey(rel, itup);
-	skey->pivotsearch = true;
+	skey->backward = true;
 
 	return skey;
 }
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 93ed5e8cc..4dcb9067a 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -142,6 +142,54 @@ SELECT b.*
   4500 | 2080851358
 (1 row)
 
+--
+-- Add coverage for optimization of backwards scan index descents
+--
+-- Here we expect _bt_search to descend straight to a leaf page containing a
+-- non-pivot tuple with the value '47', which comes last (after 11 similar
+-- non-pivot tuples).  Query execution should only need to visit a single
+-- leaf page here, even though we happen to be interested in a tuple that
+-- the nbtree implementation reasons about as a boundary case.
+--
+-- Test case relies on tenk1_hundred index having a leaf page whose high key
+-- is '(48, -inf)'.  We use a low cardinality index to make our test case less
+-- sensitive to implementation details that may change in the future.
+set enable_seqscan to false;
+set enable_indexscan to true;
+set enable_bitmapscan to false;
+explain (costs off)
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Limit
+   ->  Index Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred < 48)
+(3 rows)
+
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+ hundred | twenty 
+---------+--------
+      47 |      7
+(1 row)
+
+-- This variant of the query need only return a single tuple located to the immediate
+-- right of the '(48, -inf)' high key.  It also only needs to scan one single
+-- leaf page (the right sibling of the page scanned by the last test case):
+explain (costs off)
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Limit
+   ->  Index Scan Backward using tenk1_hundred on tenk1
+         Index Cond: (hundred <= 48)
+(3 rows)
+
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+ hundred | twenty 
+---------+--------
+      48 |      8
+(1 row)
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 239f4a475..2075b45a6 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -110,6 +110,32 @@ SELECT b.*
    FROM bt_f8_heap b
    WHERE b.seqno = '4500'::float8;
 
+--
+-- Add coverage for optimization of backwards scan index descents
+--
+-- Here we expect _bt_search to descend straight to a leaf page containing a
+-- non-pivot tuple with the value '47', which comes last (after 11 similar
+-- non-pivot tuples).  Query execution should only need to visit a single
+-- leaf page here, even though we happen to be interested in a tuple that
+-- the nbtree implementation reasons about as a boundary case.
+--
+-- Test case relies on tenk1_hundred index having a leaf page whose high key
+-- is '(48, -inf)'.  We use a low cardinality index to make our test case less
+-- sensitive to implementation details that may change in the future.
+set enable_seqscan to false;
+set enable_indexscan to true;
+set enable_bitmapscan to false;
+explain (costs off)
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+select hundred, twenty from tenk1 where hundred < 48 order by hundred desc limit 1;
+
+-- This variant of the query need only return a single tuple located to the immediate
+-- right of the '(48, -inf)' high key.  It also only needs to scan one single
+-- leaf page (the right sibling of the page scanned by the last test case):
+explain (costs off)
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+select hundred, twenty from tenk1 where hundred <= 48 order by hundred desc limit 1;
+
 --
 -- Check correct optimization of LIKE (special index operator support)
 -- for both indexscan and bitmapscan cases
-- 
2.42.0

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Peter Geoghegan (#4)

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

On Sun, 15 Oct 2023 at 22:56, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Sep 18, 2023 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3, which is a straightforward rebase of v2. v3 is needed
to get the patch to apply cleanly against HEAD - so no real changes
here.

Attached is v4. Just to keep CFTester happy.

@@ -402,10 +405,27 @@ _bt_binsrch(Relation rel,
+        if (unlikely(key->backward))
+            return OffsetNumberPrev(low);
+
return low;

I wonder if this is (or can be) optimized to the mostly equivalent
"return low - (OffsetNumber) key->backward", as that would remove an
"unlikely" branch that isn't very unlikely during page deletion, even
if page deletion by itself is quite rare.
I'm not sure it's worth the additional cognitive overhead, or if there
are any significant performance implications for the hot path.

@@ -318,9 +318,12 @@ _bt_moveright(Relation rel,
[...]
* On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
[...]
+ * key >= given scankey, or > scankey if nextkey is true for forward scans.
+ * _bt_binsrch() also "steps back" by one item/tuple on the leaf level in the
+ * case of backward scans.  (NOTE: this means it is possible to return a value
+ * that's 1 greater than the number of keys on the leaf page.  It also means
+ * that we can return an item 1 less than the first non-pivot tuple on any
+ * leaf page.)

I think this can use a bit more wordsmithing: the use of "also" with
"steps back" implies we also step back in other cases, which aren't
mentioned. Could you update the wording to be more clear about this?

@@ -767,7 +787,7 @@ _bt_compare(Relation rel,
[...]
-         * Most searches have a scankey that is considered greater than a
+         * Forward scans have a scankey that is considered greater than a

Although it's not strictly an issue for this patch, the comment here
doesn't describe backward scans in as much detail as forward scans
here. The concepts are mostly "do the same but in reverse", but the
difference is noticable.

Apart from these comments, no further noteworthy comments. Looks good.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)