Avoiding superfluous buffer locking during nbtree backwards scans

Started by Peter Geogheganover 1 year ago24 messages

pg@bowt.ie

over 1 year ago

1 attachment(s)

Attached patch teaches nbtree backwards scans to avoid needlessly
relocking a previously read page/buffer at the point where we need to
consider reading the next page (the page to the left).

Currently, we do this regardless of whether or not we already decided
to end the scan, back when we read the page within _bt_readpage. We'll
relock the page we've just read (and just unlocked), just to follow
its left link, while making sure to deal with concurrent page splits
correctly. But why bother relocking the page, or with thinking about
concurrent splits, if we can already plainly see that we cannot
possibly need to find matching tuples on the left sibling page?

The patch just adds a simple precheck, which works in the obvious way.
Seems like this was a missed opportunity for commit 2ed5b87f96.

On HEAD, the following query requires 24 buffer hits (I'm getting a
standard/forward index scan for this):

select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc;

However, we don't see that with the backwards scan variant:

select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc;

We actually see 30 buffer hits for this on HEAD. Whereas with the
patch, both variants get only 24 buffer hits -- there's parity, at
least in cases like these two.

Note that we only "achieve parity" here because we happen to be using
a SAOP, requiring multiple primitive index scans, each of which ends
with its own superfluous lock acquisition. Backwards scans remain at a
disadvantage with regard to buffer locks acquired in other cases --
cases that happen to involve scanning several sibling leaf pages in
sequence (no change there).

It's probably possible to fix those more complicated cases too. But
that would require a significantly more complicated approach. I
haven't touched existing comments in _bt_readnextpage that contemplate
this possibility.

--
Peter Geoghegan

Attachments:

v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-lock.patchapplication/octet-stream; name=v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-lock.patchDownload

From d6d73fbbccd76920e09643ca46b097addb45fba8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 4 Jul 2024 16:11:04 -0400
Subject: [PATCH v1] Avoid unneeded nbtree backwards scan buffer lock.

Author: Peter Geoghegan <pg@bowt.ie>
---
 src/backend/access/nbtree/nbtsearch.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 57bcfc7e4..56089bd53 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1916,16 +1916,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					}
 				}
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!pstate.continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreLeft = false;
 				break;
-			}
 
 			offnum = OffsetNumberPrev(offnum);
 		}
 
+		/*
+		 * We don't need to visit page to the left when no more matches will
+		 * be found there
+		 */
+		if (!pstate.continuescan || P_LEFTMOST(opaque))
+			so->currPos.moreLeft = false;
+
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
@@ -2230,6 +2234,15 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	}
 	else
 	{
+		/* Precheck: done if we know there are no matching keys to the left */
+		if (!so->currPos.moreLeft)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosUnpinIfPinned(so->currPos);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+
 		/*
 		 * Should only happen in parallel cases, when some other backend
 		 * advanced the scan.
-- 
2.45.2

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Peter Geoghegan (#1)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Fri, 5 Jul 2024 at 18:48, Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch teaches nbtree backwards scans to avoid needlessly
relocking a previously read page/buffer at the point where we need to
consider reading the next page (the page to the left).

+1, LGTM.

This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Matthias van de Meent (#2)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

+1, LGTM.

This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.

Pushed just now. Thanks for the review!

--
Peter Geoghegan

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Peter Geoghegan (#3)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

+1, LGTM.

This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.

Here's a new patch that further improves the situation, so that we
don't try to re-lock the buffer we just accessed when we're stepping
backward in index scans, reducing buffer lock operations in the common
case by 1/2.

It also further decreases the number of code differences between
forward and backward scans in _bt_steppage and _bt_readnextpage, with
mostly only small differences remaining in the code paths shared
between the two scan modes.

The change in lwlock.c is to silence a warning when LWLOCK_STATS is enabled.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

I've validated my results by compiling with LWLOCK_STATS enabled (e.g.
#define LWLOCK_STATS), and testing backward index scans, e.g.

CREATE TABLE test AS SELECT generate_series(1, 1000000) as num;
CREATE INDEX ON test (num);
VACUUM (FREEZE) test;
\c -- reconnect to get fresh lwlock stats
SELECT COUNT(num ORDER BY num DESC) FROM test;
\c -- reconnect to dump stats of previous session

Before this patch I consistently got `BufferContent 0xYYYYYYYY: shacq
2` in the logs, but with this patch that has been decreased to
`BufferContent 0xYYYYYYYY: shacq 1`

Attachments:

v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchapplication/octet-stream; name=v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchDownload

From fe8385c80181b0cca2edb3740efff684334b1ee2 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 19 Aug 2024 11:40:09 +0200
Subject: [PATCH v1] Avoid unneeded nbtree backwards scan buffer locks.

Instead of accessing the last scanned buffer, we now optimistically scan the left
sibling block of the current scan, optimistically decreasing the number of buffer
accesses in backward scans.
---
 src/backend/access/nbtree/nbtree.c    |  15 +-
 src/backend/access/nbtree/nbtsearch.c | 190 +++++++++++---------------
 src/backend/storage/lmgr/lwlock.c     |   1 +
 src/include/access/nbtree.h           |   5 +-
 4 files changed, 94 insertions(+), 117 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f7..86f6a29d8e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -67,6 +67,8 @@ typedef enum
 typedef struct BTParallelScanDescData
 {
 	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BlockNumber btps_scanPrev;	/* previous page to be scanned, used in
+									 * parallel backward scans. */
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
@@ -602,7 +604,7 @@ btparallelrescan(IndexScanDesc scan)
  * for first=false callers that require another primitive index scan.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, BlockNumber *prevpageno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		exit_loop = false;
@@ -611,6 +613,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 	BTParallelScanDesc btscan;
 
 	*pageno = P_NONE;
+	if (prevpageno)
+		*prevpageno = P_NONE;
 
 	if (first)
 	{
@@ -671,6 +675,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 				so->needPrimScan = true;
 				so->scanBehind = false;
 				*pageno = InvalidBlockNumber;
+
+				if (prevpageno)
+					*prevpageno = P_NONE;
 				exit_loop = true;
 			}
 		}
@@ -682,6 +689,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
 			*pageno = btscan->btps_scanPage;
+			if (prevpageno)
+				*prevpageno = btscan->btps_scanPrev;
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -706,7 +715,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
  * scan lands on scan_page).
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page,
+					 BlockNumber scan_prev)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -716,6 +726,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = scan_page;
+	btscan->btps_scanPrev = scan_prev;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2551df8a67..cc80b45ac5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -43,10 +43,9 @@ static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
-static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
-								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber expectNext, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber nextblkno, ScanDirection dir);
+static Buffer _bt_walk_left(Relation rel, BlockNumber blkno, BlockNumber oblknum);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -926,7 +925,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno, true);
+		BlockNumber	expectNext;
+		status = _bt_parallel_seize(scan, &blkno, &expectNext, true);
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -944,7 +944,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, expectNext, dir))
 				return false;
 			goto readcomplete;
 		}
@@ -1584,9 +1584,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		if (ScanDirectionIsForward(dir))
 			pstate.prev_scan_page = opaque->btpo_next;
 		else
-			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
+			pstate.prev_scan_page = opaque->btpo_prev;
 
-		_bt_parallel_release(scan, pstate.prev_scan_page);
+		_bt_parallel_release(scan, pstate.prev_scan_page, so->currPos.currPage);
 	}
 
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
@@ -1623,10 +1623,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
-	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * where to step right to after we're done with these items.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -2044,6 +2044,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber expectNext = P_NONE;
 	bool		status;
 
 	Assert(BTScanPosIsValid(so->currPos));
@@ -2105,7 +2106,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the next block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
+			status = _bt_parallel_seize(scan, &blkno, NULL, false);
 			if (!status)
 			{
 				/* release the previous buffer, if pinned */
@@ -2128,31 +2129,34 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
-
 		if (scan->parallel_scan != NULL)
 		{
 			/*
 			 * Seize the scan to get the current block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			BTScanPosUnpinIfPinned(so->currPos);
+			status = _bt_parallel_seize(scan, &blkno, &expectNext, false);
 			if (!status)
 			{
+				BTScanPosUnpinIfPinned(so->currPos);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 		}
 		else
 		{
-			/* Not parallel, so just use our own notion of the current page */
-			blkno = so->currPos.currPage;
+			/* Not parallel, so use the previously-saved prevPage link. */
+			blkno = so->currPos.prevPage;
+			expectNext = so->currPos.currPage;
 		}
+
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		BTScanPosUnpinIfPinned(so->currPos);
 	}
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, expectNext, dir))
 		return false;
 
 	/* We have at least one item to return as scan's next item */
@@ -2172,7 +2176,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
 static bool
-_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber expectNext,
+				 ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -2184,6 +2189,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		Assert(expectNext == P_NONE);
+
 		for (;;)
 		{
 			/*
@@ -2214,14 +2221,14 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
+				_bt_parallel_release(scan, opaque->btpo_next, blkno);
 			}
 
 			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, NULL, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2237,75 +2244,32 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	}
 	else
 	{
-		/*
-		 * Should only happen in parallel cases, when some other backend
-		 * advanced the scan.
-		 */
-		if (so->currPos.currPage != blkno)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			so->currPos.currPage = blkno;
-		}
-
-		/* Done if we know that the left sibling link isn't of interest */
-		if (!so->currPos.moreLeft)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			_bt_parallel_done(scan);
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
-
-		/*
-		 * Walk left to the next page with data.  This is much more complex
-		 * than the walk-right case because of the possibility that the page
-		 * to our left splits while we are in flight to it, plus the
-		 * possibility that the page we were on gets deleted after we leave
-		 * it.  See nbtree/README for details.
-		 *
-		 * It might be possible to rearrange this code to have less overhead
-		 * in pinning and locking, but that would require capturing the left
-		 * sibling block number when the page is initially read, and then
-		 * optimistically starting there (rather than pinning the page twice).
-		 * It is not clear that this would be worth the complexity.
-		 */
-		if (BTScanPosIsPinned(so->currPos))
-			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
-		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
 		for (;;)
 		{
-			/* Done if we know that the left sibling link isn't of interest */
-			if (!so->currPos.moreLeft)
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE || !so->currPos.moreLeft)
 			{
-				_bt_relbuf(rel, so->currPos.buf);
 				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 
-			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			CHECK_FOR_INTERRUPTS();
 
-			/* if we're physically at end of index, return failure */
-			if (so->currPos.buf == InvalidBuffer)
-			{
-				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
+			Assert(!BTScanPosIsPinned(so->currPos));
 
-			/*
-			 * Okay, we managed to move left to a non-deleted page. Done if
-			 * it's not half-dead and contains matching tuples. Else loop back
-			 * and do it all again.
-			 */
+			/* Step left one page */
+			so->currPos.buf = _bt_walk_left(rel, blkno, expectNext);
 			page = BufferGetPage(so->currPos.buf);
 			opaque = BTPageGetOpaque(page);
+
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
-				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), false))
@@ -2314,25 +2278,26 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+				_bt_parallel_release(scan, opaque->btpo_prev, blkno);
 			}
 
-			/*
-			 * For parallel scans, get the last page scanned as it is quite
-			 * possible that by the time we try to seize the scan, some other
-			 * worker has already advanced the scan to a different page.  We
-			 * must continue based on the latest page scanned by any worker.
-			 */
+			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, &expectNext, false);
+
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
+			else
+			{
+				expectNext = blkno;
+				blkno = opaque->btpo_prev;
+				_bt_relbuf(rel, so->currPos.buf);
 			}
 		}
 	}
@@ -2347,7 +2312,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * indicate success.
  */
 static bool
-_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
+					  BlockNumber nextblkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
@@ -2355,7 +2321,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	_bt_initialize_more_data(so, dir);
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, nextblkno, dir))
 		return false;
 
 	/* We have at least one item to return as scan's next item */
@@ -2367,43 +2333,29 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 /*
  * _bt_walk_left() -- step left one page, if possible
  *
- * The given buffer must be pinned and read-locked.  This will be dropped
- * before stepping left.  On return, we have pin and read lock on the
- * returned page, instead.
+ * On return, we have a pin and read lock on the returned page.
  *
  * Returns InvalidBuffer if there is no page to the left (no lock is held
  * in that case).
  *
  * It is possible for the returned leaf page to be half-dead; caller must
  * check that condition and step left again when required.
+ *
+ * oblknum is the block we're walking left from
+ * blkno is the left sibling of oblknum when we last read that page
+ * (i.e. blkno was stored in oblknum's bt_prev).
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(Relation rel, BlockNumber blkno, BlockNumber oblknum)
 {
 	Page		page;
 	BTPageOpaque opaque;
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
+	BlockNumber lblkno = blkno;
 
 	for (;;)
 	{
-		BlockNumber obknum;
-		BlockNumber lblkno;
-		BlockNumber blkno;
+		Buffer		buf;
 		int			tries;
-
-		/* if we're at end of tree, release buf and return failure */
-		if (P_LEFTMOST(opaque))
-		{
-			_bt_relbuf(rel, buf);
-			break;
-		}
-		/* remember original page we are stepping left from */
-		obknum = BufferGetBlockNumber(buf);
-		/* step left */
-		blkno = lblkno = opaque->btpo_prev;
-		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
 		buf = _bt_getbuf(rel, blkno, BT_READ);
@@ -2423,7 +2375,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		tries = 0;
 		for (;;)
 		{
-			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
+			if (!P_ISDELETED(opaque) && opaque->btpo_next == oblknum)
 			{
 				/* Found desired page, return it */
 				return buf;
@@ -2437,7 +2389,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		}
 
 		/* Return to the original page to see what's up */
-		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
+		buf = _bt_relandgetbuf(rel, buf, oblknum, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		if (P_ISDELETED(opaque))
@@ -2475,9 +2427,21 @@ _bt_walk_left(Relation rel, Buffer buf)
 			 */
 			if (opaque->btpo_prev == lblkno)
 				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
-					 obknum, RelationGetRelationName(rel));
+					 oblknum, RelationGetRelationName(rel));
 			/* Okay to try again with new lblkno value */
 		}
+
+		/* if we're at end of tree, release buf and return failure */
+		if (P_LEFTMOST(opaque))
+		{
+			_bt_relbuf(rel, buf);
+			break;
+		}
+		/* remember original page we are stepping left from */
+		oblknum = BufferGetBlockNumber(buf);
+		/* step left */
+		blkno = lblkno = opaque->btpo_prev;
+		_bt_relbuf(rel, buf);
 	}
 
 	return InvalidBuffer;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d80..5cff74d6d2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -87,6 +87,7 @@
 #include "utils/memutils.h"
 
 #ifdef LWLOCK_STATS
+#include "storage/ipc.h"
 #include "utils/hsearch.h"
 #endif
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9af9b3ecdc..2e263b4036 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -955,6 +955,7 @@ typedef struct BTScanPosData
 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -1191,8 +1192,8 @@ extern bool btcanreturn(Relation index, int attno);
  * prototypes for internal functions in nbtree.c
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
-							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+							   BlockNumber *prevpageno, bool first);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page, BlockNumber scan_prev);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
 										   BlockNumber prev_scan_page);
-- 
2.40.1

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#4)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Mon, 19 Aug 2024 at 13:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

+1, LGTM.

This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.

Here's a new patch that further improves the situation, so that we
don't try to re-lock the buffer we just accessed when we're stepping
backward in index scans, reducing buffer lock operations in the common
case by 1/2.

Attached is an updated version of the patch, now v2, which fixes some
assertion failures for parallel plans by passing the correct
parameters to _bt_parallel_release for forward scans.

With the test setup below, it unifies the number of buffer accesses
between forward and backward scans:

CREATE TABLE test AS
SELECT generate_series(1, 1000000) as num,
'' j;
CREATE INDEX ON test (num);
VACUUM (FREEZE) test;
SET enable_seqscan = off; SET max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(j ORDER BY num DESC) -- or ASC
FROM test;

The test case will have an Index Scan, which in DESC case is backward.
Without this patch, the IS will have 7160 accesses for the ASC
ordering, but 9892 in the DESC case (an increase of 2732,
approximately equivalent to the number leaf pages in the index), while
with this patch, the IndexScan will have 7160 buffer accesses for both
ASC and DESC ordering.

In my previous mail I used buffer lock stats from an index-only scan
as proof of the patch working. It's been pointed out to me that an
IndexScan is easier to extract this data from, as it drops the pin on
the page after getting some results from a page.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v2-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchapplication/octet-stream; name=v2-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchDownload

From 1f0406aaf427287d6de25d27d87d45b6093d06d7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 19 Aug 2024 11:40:09 +0200
Subject: [PATCH v2] Avoid unneeded nbtree backwards scan buffer locks.

Instead of accessing the last scanned buffer, we now optimistically scan the left
sibling block of the current scan, optimistically decreasing the number of buffer
accesses in backward index scans, and similar buffer lock operations in backward
index-only scans.
---
 src/include/access/nbtree.h           |   5 +-
 src/backend/access/nbtree/nbtree.c    |  15 +-
 src/backend/access/nbtree/nbtsearch.c | 196 +++++++++++---------------
 src/backend/storage/lmgr/lwlock.c     |   1 +
 4 files changed, 99 insertions(+), 118 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9af9b3ecdc..2e263b4036 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -955,6 +955,7 @@ typedef struct BTScanPosData
 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -1191,8 +1192,8 @@ extern bool btcanreturn(Relation index, int attno);
  * prototypes for internal functions in nbtree.c
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
-							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+							   BlockNumber *prevpageno, bool first);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page, BlockNumber scan_prev);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
 										   BlockNumber prev_scan_page);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f7..86f6a29d8e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -67,6 +67,8 @@ typedef enum
 typedef struct BTParallelScanDescData
 {
 	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BlockNumber btps_scanPrev;	/* previous page to be scanned, used in
+									 * parallel backward scans. */
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
@@ -602,7 +604,7 @@ btparallelrescan(IndexScanDesc scan)
  * for first=false callers that require another primitive index scan.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, BlockNumber *prevpageno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		exit_loop = false;
@@ -611,6 +613,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 	BTParallelScanDesc btscan;
 
 	*pageno = P_NONE;
+	if (prevpageno)
+		*prevpageno = P_NONE;
 
 	if (first)
 	{
@@ -671,6 +675,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 				so->needPrimScan = true;
 				so->scanBehind = false;
 				*pageno = InvalidBlockNumber;
+
+				if (prevpageno)
+					*prevpageno = P_NONE;
 				exit_loop = true;
 			}
 		}
@@ -682,6 +689,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
 			*pageno = btscan->btps_scanPage;
+			if (prevpageno)
+				*prevpageno = btscan->btps_scanPrev;
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -706,7 +715,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
  * scan lands on scan_page).
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page,
+					 BlockNumber scan_prev)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -716,6 +726,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = scan_page;
+	btscan->btps_scanPrev = scan_prev;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2551df8a67..f123456f85 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -43,10 +43,9 @@ static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
-static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
-								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber expectNext, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber nextblkno, ScanDirection dir);
+static Buffer _bt_walk_left(Relation rel, BlockNumber blkno, BlockNumber oblknum);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -926,7 +925,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno, true);
+		BlockNumber	expectNext;
+		status = _bt_parallel_seize(scan, &blkno, &expectNext, true);
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -944,7 +944,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, expectNext, dir))
 				return false;
 			goto readcomplete;
 		}
@@ -1582,11 +1582,15 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	if (scan->parallel_scan)
 	{
 		if (ScanDirectionIsForward(dir))
+		{
 			pstate.prev_scan_page = opaque->btpo_next;
+			_bt_parallel_release(scan, pstate.prev_scan_page, P_NONE);
+		}
 		else
-			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
-
-		_bt_parallel_release(scan, pstate.prev_scan_page);
+		{
+			pstate.prev_scan_page = opaque->btpo_prev;
+			_bt_parallel_release(scan, pstate.prev_scan_page, so->currPos.currPage);
+		}
 	}
 
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
@@ -1623,10 +1627,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
-	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * where to step right to after we're done with these items.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -2044,6 +2048,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber expectNext = P_NONE;
 	bool		status;
 
 	Assert(BTScanPosIsValid(so->currPos));
@@ -2105,7 +2110,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the next block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
+			status = _bt_parallel_seize(scan, &blkno, NULL, false);
 			if (!status)
 			{
 				/* release the previous buffer, if pinned */
@@ -2128,31 +2133,34 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
-
 		if (scan->parallel_scan != NULL)
 		{
 			/*
 			 * Seize the scan to get the current block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			BTScanPosUnpinIfPinned(so->currPos);
+			status = _bt_parallel_seize(scan, &blkno, &expectNext, false);
 			if (!status)
 			{
+				BTScanPosUnpinIfPinned(so->currPos);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 		}
 		else
 		{
-			/* Not parallel, so just use our own notion of the current page */
-			blkno = so->currPos.currPage;
+			/* Not parallel, so use the previously-saved prevPage link. */
+			blkno = so->currPos.prevPage;
+			expectNext = so->currPos.currPage;
 		}
+
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		BTScanPosUnpinIfPinned(so->currPos);
 	}
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, expectNext, dir))
 		return false;
 
 	/* We have at least one item to return as scan's next item */
@@ -2172,7 +2180,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
 static bool
-_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber expectNext,
+				 ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -2184,6 +2193,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		Assert(expectNext == P_NONE);
+
 		for (;;)
 		{
 			/*
@@ -2214,14 +2225,14 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
+				_bt_parallel_release(scan, opaque->btpo_next, P_NONE);
 			}
 
 			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, NULL, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2237,75 +2248,32 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	}
 	else
 	{
-		/*
-		 * Should only happen in parallel cases, when some other backend
-		 * advanced the scan.
-		 */
-		if (so->currPos.currPage != blkno)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			so->currPos.currPage = blkno;
-		}
-
-		/* Done if we know that the left sibling link isn't of interest */
-		if (!so->currPos.moreLeft)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			_bt_parallel_done(scan);
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
-
-		/*
-		 * Walk left to the next page with data.  This is much more complex
-		 * than the walk-right case because of the possibility that the page
-		 * to our left splits while we are in flight to it, plus the
-		 * possibility that the page we were on gets deleted after we leave
-		 * it.  See nbtree/README for details.
-		 *
-		 * It might be possible to rearrange this code to have less overhead
-		 * in pinning and locking, but that would require capturing the left
-		 * sibling block number when the page is initially read, and then
-		 * optimistically starting there (rather than pinning the page twice).
-		 * It is not clear that this would be worth the complexity.
-		 */
-		if (BTScanPosIsPinned(so->currPos))
-			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
-		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
 		for (;;)
 		{
-			/* Done if we know that the left sibling link isn't of interest */
-			if (!so->currPos.moreLeft)
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE || !so->currPos.moreLeft)
 			{
-				_bt_relbuf(rel, so->currPos.buf);
 				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 
-			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			CHECK_FOR_INTERRUPTS();
 
-			/* if we're physically at end of index, return failure */
-			if (so->currPos.buf == InvalidBuffer)
-			{
-				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
+			Assert(!BTScanPosIsPinned(so->currPos));
 
-			/*
-			 * Okay, we managed to move left to a non-deleted page. Done if
-			 * it's not half-dead and contains matching tuples. Else loop back
-			 * and do it all again.
-			 */
+			/* Step left one page */
+			so->currPos.buf = _bt_walk_left(rel, blkno, expectNext);
 			page = BufferGetPage(so->currPos.buf);
 			opaque = BTPageGetOpaque(page);
+
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
-				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), false))
@@ -2314,25 +2282,26 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+				_bt_parallel_release(scan, opaque->btpo_prev, blkno);
 			}
 
-			/*
-			 * For parallel scans, get the last page scanned as it is quite
-			 * possible that by the time we try to seize the scan, some other
-			 * worker has already advanced the scan to a different page.  We
-			 * must continue based on the latest page scanned by any worker.
-			 */
+			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, &expectNext, false);
+
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
+			else
+			{
+				expectNext = blkno;
+				blkno = opaque->btpo_prev;
+				_bt_relbuf(rel, so->currPos.buf);
 			}
 		}
 	}
@@ -2347,7 +2316,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * indicate success.
  */
 static bool
-_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
+					  BlockNumber nextblkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
@@ -2355,7 +2325,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	_bt_initialize_more_data(so, dir);
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, nextblkno, dir))
 		return false;
 
 	/* We have at least one item to return as scan's next item */
@@ -2367,43 +2337,29 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 /*
  * _bt_walk_left() -- step left one page, if possible
  *
- * The given buffer must be pinned and read-locked.  This will be dropped
- * before stepping left.  On return, we have pin and read lock on the
- * returned page, instead.
+ * On return, we have a pin and read lock on the returned page.
  *
  * Returns InvalidBuffer if there is no page to the left (no lock is held
  * in that case).
  *
  * It is possible for the returned leaf page to be half-dead; caller must
  * check that condition and step left again when required.
+ *
+ * oblknum is the block we're walking left from
+ * blkno is the left sibling of oblknum when we last read that page
+ * (i.e. blkno was stored in oblknum's bt_prev).
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(Relation rel, BlockNumber blkno, BlockNumber oblknum)
 {
 	Page		page;
 	BTPageOpaque opaque;
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
+	BlockNumber lblkno = blkno;
 
 	for (;;)
 	{
-		BlockNumber obknum;
-		BlockNumber lblkno;
-		BlockNumber blkno;
+		Buffer		buf;
 		int			tries;
-
-		/* if we're at end of tree, release buf and return failure */
-		if (P_LEFTMOST(opaque))
-		{
-			_bt_relbuf(rel, buf);
-			break;
-		}
-		/* remember original page we are stepping left from */
-		obknum = BufferGetBlockNumber(buf);
-		/* step left */
-		blkno = lblkno = opaque->btpo_prev;
-		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
 		buf = _bt_getbuf(rel, blkno, BT_READ);
@@ -2423,7 +2379,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		tries = 0;
 		for (;;)
 		{
-			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
+			if (!P_ISDELETED(opaque) && opaque->btpo_next == oblknum)
 			{
 				/* Found desired page, return it */
 				return buf;
@@ -2437,7 +2393,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		}
 
 		/* Return to the original page to see what's up */
-		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
+		buf = _bt_relandgetbuf(rel, buf, oblknum, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		if (P_ISDELETED(opaque))
@@ -2475,9 +2431,21 @@ _bt_walk_left(Relation rel, Buffer buf)
 			 */
 			if (opaque->btpo_prev == lblkno)
 				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
-					 obknum, RelationGetRelationName(rel));
+					 oblknum, RelationGetRelationName(rel));
 			/* Okay to try again with new lblkno value */
 		}
+
+		/* if we're at end of tree, release buf and return failure */
+		if (P_LEFTMOST(opaque))
+		{
+			_bt_relbuf(rel, buf);
+			break;
+		}
+		/* remember original page we are stepping left from */
+		oblknum = BufferGetBlockNumber(buf);
+		/* step left */
+		blkno = lblkno = opaque->btpo_prev;
+		_bt_relbuf(rel, buf);
 	}
 
 	return InvalidBuffer;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d80..5cff74d6d2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -87,6 +87,7 @@
 #include "utils/memutils.h"
 
 #ifdef LWLOCK_STATS
+#include "storage/ipc.h"
 #include "utils/hsearch.h"
 #endif
 
-- 
2.40.1

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#5)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Fri, 30 Aug 2024 at 21:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 19 Aug 2024 at 13:43, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

+1, LGTM.

This changes the backward scan code in _bt_readpage to have an
approximately equivalent handling as the forward scan case for
end-of-scan cases, which is an improvement IMO.

Here's a new patch that further improves the situation, so that we
don't try to re-lock the buffer we just accessed when we're stepping
backward in index scans, reducing buffer lock operations in the common
case by 1/2.

Attached is an updated version of the patch, now v2, which fixes some
assertion failures for parallel plans by passing the correct
parameters to _bt_parallel_release for forward scans.

I noticed I attached an older version of the patch which still had 1
assertion failure case remaining (thanks cfbot), so here's v3 which
solves that problem.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchapplication/octet-stream; name=v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patchDownload

From bfd9a4c2b81c0bd02903afbb28ddb68ad3dfbcce Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 19 Aug 2024 11:40:09 +0200
Subject: [PATCH v3] Avoid unneeded nbtree backwards scan buffer locks.

Instead of accessing the last scanned buffer, we now optimistically scan the left
sibling block of the current scan, optimistically decreasing the number of buffer
accesses in backward index scans, and similar buffer lock operations in backward
index-only scans.
---
 src/include/access/nbtree.h           |   5 +-
 src/backend/access/nbtree/nbtree.c    |  15 +-
 src/backend/access/nbtree/nbtsearch.c | 198 +++++++++++---------------
 src/backend/storage/lmgr/lwlock.c     |   1 +
 4 files changed, 101 insertions(+), 118 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9af9b3ecdc..2e263b4036 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -955,6 +955,7 @@ typedef struct BTScanPosData
 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -1191,8 +1192,8 @@ extern bool btcanreturn(Relation index, int attno);
  * prototypes for internal functions in nbtree.c
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
-							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+							   BlockNumber *prevpageno, bool first);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page, BlockNumber scan_prev);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
 										   BlockNumber prev_scan_page);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f7..86f6a29d8e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -67,6 +67,8 @@ typedef enum
 typedef struct BTParallelScanDescData
 {
 	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BlockNumber btps_scanPrev;	/* previous page to be scanned, used in
+									 * parallel backward scans. */
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
@@ -602,7 +604,7 @@ btparallelrescan(IndexScanDesc scan)
  * for first=false callers that require another primitive index scan.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, BlockNumber *prevpageno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		exit_loop = false;
@@ -611,6 +613,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 	BTParallelScanDesc btscan;
 
 	*pageno = P_NONE;
+	if (prevpageno)
+		*prevpageno = P_NONE;
 
 	if (first)
 	{
@@ -671,6 +675,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 				so->needPrimScan = true;
 				so->scanBehind = false;
 				*pageno = InvalidBlockNumber;
+
+				if (prevpageno)
+					*prevpageno = P_NONE;
 				exit_loop = true;
 			}
 		}
@@ -682,6 +689,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
 			*pageno = btscan->btps_scanPage;
+			if (prevpageno)
+				*prevpageno = btscan->btps_scanPrev;
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -706,7 +715,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
  * scan lands on scan_page).
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page,
+					 BlockNumber scan_prev)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -716,6 +726,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = scan_page;
+	btscan->btps_scanPrev = scan_prev;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2551df8a67..75ba1f9789 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -43,10 +43,9 @@ static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
-static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
-								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber expectNext, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber nextblkno, ScanDirection dir);
+static Buffer _bt_walk_left(Relation rel, BlockNumber blkno, BlockNumber oblknum);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -926,7 +925,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno, true);
+		BlockNumber	expectNext;
+		status = _bt_parallel_seize(scan, &blkno, &expectNext, true);
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -944,7 +944,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, expectNext, dir))
 				return false;
 			goto readcomplete;
 		}
@@ -1582,11 +1582,15 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	if (scan->parallel_scan)
 	{
 		if (ScanDirectionIsForward(dir))
+		{
 			pstate.prev_scan_page = opaque->btpo_next;
+			_bt_parallel_release(scan, pstate.prev_scan_page, P_NONE);
+		}
 		else
-			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
-
-		_bt_parallel_release(scan, pstate.prev_scan_page);
+		{
+			pstate.prev_scan_page = opaque->btpo_prev;
+			_bt_parallel_release(scan, pstate.prev_scan_page, so->currPos.currPage);
+		}
 	}
 
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
@@ -1623,10 +1627,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
-	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * where to step right to after we're done with these items.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -2044,6 +2048,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber expectNext = P_NONE;
 	bool		status;
 
 	Assert(BTScanPosIsValid(so->currPos));
@@ -2105,7 +2110,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			 * Seize the scan to get the next block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
+			status = _bt_parallel_seize(scan, &blkno, NULL, false);
 			if (!status)
 			{
 				/* release the previous buffer, if pinned */
@@ -2128,31 +2133,34 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
-
 		if (scan->parallel_scan != NULL)
 		{
 			/*
 			 * Seize the scan to get the current block number; if the scan has
 			 * ended already, bail out.
 			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			BTScanPosUnpinIfPinned(so->currPos);
+			status = _bt_parallel_seize(scan, &blkno, &expectNext, false);
 			if (!status)
 			{
+				BTScanPosUnpinIfPinned(so->currPos);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 		}
 		else
 		{
-			/* Not parallel, so just use our own notion of the current page */
-			blkno = so->currPos.currPage;
+			/* Not parallel, so use the previously-saved prevPage link. */
+			blkno = so->currPos.prevPage;
+			expectNext = so->currPos.currPage;
 		}
+
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		BTScanPosUnpinIfPinned(so->currPos);
 	}
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, expectNext, dir))
 		return false;
 
 	/* We have at least one item to return as scan's next item */
@@ -2172,7 +2180,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
 static bool
-_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, BlockNumber expectNext,
+				 ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -2184,6 +2193,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		Assert(expectNext == P_NONE);
+
 		for (;;)
 		{
 			/*
@@ -2214,14 +2225,14 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
+				_bt_parallel_release(scan, opaque->btpo_next, P_NONE);
 			}
 
 			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, NULL, false);
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
@@ -2232,80 +2243,38 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			{
 				blkno = opaque->btpo_next;
 				_bt_relbuf(rel, so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
 			}
 		}
 	}
 	else
 	{
-		/*
-		 * Should only happen in parallel cases, when some other backend
-		 * advanced the scan.
-		 */
-		if (so->currPos.currPage != blkno)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			so->currPos.currPage = blkno;
-		}
-
-		/* Done if we know that the left sibling link isn't of interest */
-		if (!so->currPos.moreLeft)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			_bt_parallel_done(scan);
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
-
-		/*
-		 * Walk left to the next page with data.  This is much more complex
-		 * than the walk-right case because of the possibility that the page
-		 * to our left splits while we are in flight to it, plus the
-		 * possibility that the page we were on gets deleted after we leave
-		 * it.  See nbtree/README for details.
-		 *
-		 * It might be possible to rearrange this code to have less overhead
-		 * in pinning and locking, but that would require capturing the left
-		 * sibling block number when the page is initially read, and then
-		 * optimistically starting there (rather than pinning the page twice).
-		 * It is not clear that this would be worth the complexity.
-		 */
-		if (BTScanPosIsPinned(so->currPos))
-			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
-		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
 		for (;;)
 		{
-			/* Done if we know that the left sibling link isn't of interest */
-			if (!so->currPos.moreLeft)
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE || !so->currPos.moreLeft)
 			{
-				_bt_relbuf(rel, so->currPos.buf);
 				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 
-			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			CHECK_FOR_INTERRUPTS();
 
-			/* if we're physically at end of index, return failure */
-			if (so->currPos.buf == InvalidBuffer)
-			{
-				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
+			Assert(!BTScanPosIsPinned(so->currPos));
 
-			/*
-			 * Okay, we managed to move left to a non-deleted page. Done if
-			 * it's not half-dead and contains matching tuples. Else loop back
-			 * and do it all again.
-			 */
+			/* Step left one page */
+			so->currPos.buf = _bt_walk_left(rel, blkno, expectNext);
 			page = BufferGetPage(so->currPos.buf);
 			opaque = BTPageGetOpaque(page);
+
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
-				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), false))
@@ -2314,25 +2283,27 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+				_bt_parallel_release(scan, opaque->btpo_prev, blkno);
 			}
 
-			/*
-			 * For parallel scans, get the last page scanned as it is quite
-			 * possible that by the time we try to seize the scan, some other
-			 * worker has already advanced the scan to a different page.  We
-			 * must continue based on the latest page scanned by any worker.
-			 */
+			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, &expectNext, false);
+
 				if (!status)
 				{
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
+			else
+			{
+				expectNext = blkno;
+				blkno = opaque->btpo_prev;
+				_bt_relbuf(rel, so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
 			}
 		}
 	}
@@ -2347,7 +2318,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * indicate success.
  */
 static bool
-_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
+					  BlockNumber nextblkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
@@ -2355,7 +2327,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	_bt_initialize_more_data(so, dir);
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, nextblkno, dir))
 		return false;
 
 	/* We have at least one item to return as scan's next item */
@@ -2367,43 +2339,29 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 /*
  * _bt_walk_left() -- step left one page, if possible
  *
- * The given buffer must be pinned and read-locked.  This will be dropped
- * before stepping left.  On return, we have pin and read lock on the
- * returned page, instead.
+ * On return, we have a pin and read lock on the returned page.
  *
  * Returns InvalidBuffer if there is no page to the left (no lock is held
  * in that case).
  *
  * It is possible for the returned leaf page to be half-dead; caller must
  * check that condition and step left again when required.
+ *
+ * oblknum is the block we're walking left from
+ * blkno is the left sibling of oblknum when we last read that page
+ * (i.e. blkno was stored in oblknum's bt_prev).
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_walk_left(Relation rel, BlockNumber blkno, BlockNumber oblknum)
 {
 	Page		page;
 	BTPageOpaque opaque;
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
+	BlockNumber lblkno = blkno;
 
 	for (;;)
 	{
-		BlockNumber obknum;
-		BlockNumber lblkno;
-		BlockNumber blkno;
+		Buffer		buf;
 		int			tries;
-
-		/* if we're at end of tree, release buf and return failure */
-		if (P_LEFTMOST(opaque))
-		{
-			_bt_relbuf(rel, buf);
-			break;
-		}
-		/* remember original page we are stepping left from */
-		obknum = BufferGetBlockNumber(buf);
-		/* step left */
-		blkno = lblkno = opaque->btpo_prev;
-		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
 		buf = _bt_getbuf(rel, blkno, BT_READ);
@@ -2423,7 +2381,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		tries = 0;
 		for (;;)
 		{
-			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
+			if (!P_ISDELETED(opaque) && opaque->btpo_next == oblknum)
 			{
 				/* Found desired page, return it */
 				return buf;
@@ -2437,7 +2395,7 @@ _bt_walk_left(Relation rel, Buffer buf)
 		}
 
 		/* Return to the original page to see what's up */
-		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
+		buf = _bt_relandgetbuf(rel, buf, oblknum, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		if (P_ISDELETED(opaque))
@@ -2475,9 +2433,21 @@ _bt_walk_left(Relation rel, Buffer buf)
 			 */
 			if (opaque->btpo_prev == lblkno)
 				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
-					 obknum, RelationGetRelationName(rel));
+					 oblknum, RelationGetRelationName(rel));
 			/* Okay to try again with new lblkno value */
 		}
+
+		/* if we're at end of tree, release buf and return failure */
+		if (P_LEFTMOST(opaque))
+		{
+			_bt_relbuf(rel, buf);
+			break;
+		}
+		/* remember original page we are stepping left from */
+		oblknum = BufferGetBlockNumber(buf);
+		/* step left */
+		blkno = lblkno = opaque->btpo_prev;
+		_bt_relbuf(rel, buf);
 	}
 
 	return InvalidBuffer;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e765754d80..5cff74d6d2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -87,6 +87,7 @@
 #include "utils/memutils.h"
 
 #ifdef LWLOCK_STATS
+#include "storage/ipc.h"
 #include "utils/hsearch.h"
 #endif
 
-- 
2.40.1

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Matthias van de Meent (#6)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Mon, Sep 02, 2024 at 01:31:55PM +0200, Matthias van de Meent wrote:

I noticed I attached an older version of the patch which still had 1
assertion failure case remaining (thanks cfbot), so here's v3 which
solves that problem.

Peter G., this is in your area of expertise. Could you look at what
Matthias has proposed?
--
Michael

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Peter Geoghegan (#1)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Fri, Jul 5, 2024 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

On HEAD, the following query requires 24 buffer hits (I'm getting a
standard/forward index scan for this):

select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc;

However, we don't see that with the backwards scan variant:

select
abalance
from
pgbench_accounts
where
aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc;

We actually see 30 buffer hits for this on HEAD. Whereas with the
patch, both variants get only 24 buffer hits -- there's parity, at
least in cases like these two.

I'll now present a similar example that shows the remaining issue that
my patch (which became my commit 3f44959f) didn't address, and how the
issue if fixed by Matthias' more recent follow-up patch:

$ pgbench -i -s 1

Now run:

explain (analyze, buffers, costs off, summary off)
select
abalance
from
pgbench_accounts order by aid;

With a warm cache, this query gets a full index scan requiring 1915
buffer hits. However, we still see worse performance for an equivalent
backwards scan:

explain (analyze, buffers, costs off, summary off)
select
abalance
from
pgbench_accounts order by aid desc;

On master, the latter query gets 2188 buffer hits -- so clearly my
commit 3f44959f left money on the table. These extra buffer hits just
don't make sense.

Matthias' patch (I tested
v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch) will
result in both variants requiring 1915 buffer hits. (Plus, of course,
query execution should be faster with Matthias' new approach.)

Though we do need special extra steps for dealing with concurrent page
splits during backwards scans (steps involving a reread of the
original page when its sibling link turns out to be stale), there
hasn't been a concurrent split here -- which is why I find the
inconsistency to be unnatural. Matthias' patch fixes the problem by
passing _bt_walk_left the left link that will now have been stashed in
the local scan state back when the page was read by _bt_readpage, so
that it (_bt_walk_left) can optimistically *start* there (and *not*
start at the page that has already been read, as on the master
branch).

Of course, the left link might be stale, but it was always inherently
necessary for _bt_walk_left to be able to deal with that. Technically
we're being more optimistic about it now, but that optimism is
extremely well justified (concurrent page splits are rare). Arguably,
this latest patch from Matthias actually makes the code simpler. It
makes the backwards scan case more similar to the forwards scan case.
This is another missed opportunity for commit 2ed5b87f96, I'd say --
that 2015 commit left things in a slightly odd in-between state, which
this patch fully corrects.

Note: my example was deliberately constructed to use an index scan
backwards, rather than an index-only scan backwards. While both types
of backwards scan will benefit in the same way (less buffer lock
traffic), you'll only see a reduction in buffers hit for an
EXPLAIN(ANALYZE, BUFFERS) involving a plain index scan backwards. This
is due to the fact that the mechanism added by commit 2ed5b87f96 will
only drop a leaf page buffer pin early for plain index scans, never
for index-only scans. My example also wouldn't make the difference
apparent with an unlogged table, for the same reason -- this
difference isn't really important, but seems worth noting to avoid any
confusion.

A nice side benefit of this approach is that it'll make it easier to
add better coverage for _bt_walk_left. The case where _bt_walk_left
notices a stale left link will still be very rare (we're really not
being all that optimistic in doing things this way), but it'll now be
easier to hit on purpose.

The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
looks reasonable to me. I don't think that there are any impediments
to committing it sometime this week.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Peter Geoghegan (#8)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Tue, Oct 8, 2024 at 12:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
looks reasonable to me. I don't think that there are any impediments
to committing it sometime this week.

I see significant benefits for parallel index-only backwards scans
with this patch. It's not that hard to see about a 10% decrease in
execution time. I imagine that there are disproportionate benefits for
the parallel case because the work we're avoiding is work that takes
place while a backend has seized the scan. Not bad!

I've been looking at this patch in more detail. I think that you
(Matthias) have gone too far in the direction of making the backwards
scan handling within _bt_readnextpage match the forwards scan
handling. You fail to update blkno when we have to deal with a
concurrent split/deletion that necessitates reading from another blkno
(when the correct left sibling page to read from changes) -- v3 still
used the potentially-stale blkno passed by _bt_readnextpage's caller.

Attached v4 deals with that issue. It also goes further in the
direction of simplifying the code in and around _bt_readnextpage.

The code in this area has always seemed to me to be quite a lot more
complicated than it really needed to be. Now seems like a good time to
refactor and simplify. It's likely that we'll need to do more work
nearby to enable prefetching during plain index scans, without
regressing the kill_prior_tuple/_bt_killitems mechanism -- Tomas
Vondra is currently working on that. The mechanism in question is
closely tied to the stuff that I'm simplifying now, particularly the
_bt_drop_lock_and_maybe_pin/TID-recycle-safety business.

The main simplification new to v4 is that v4 isolates the need to call
_bt_drop_lock_and_maybe_pin to only 2 functions: _bt_readnextpage, and
its new _bt_readfirstpage "sibling" function. The functions have
similar preconditions, and identical postconditions -- all of which
are now clearly documented.

It's not just _bt_drop_lock_and_maybe_pin that's only called from
these 2 functions in v4. All calls to _bt_readpage (plus their
associated PredicateLockPage calls) are now also isolated to those
same 2 functions. This shift enables removing the
_bt_parallel_readpage function entirely -- the point at the top of
_bt_first where _bt_parallel_readpage used to be called can now call
_bt_readnextpage directly instead. ISTM that having a separate
_bt_parallel_readpage function never made much sense -- it was
papering over the fact that the interfaces it uses are so confusing.

v4 also demotes _bt_steppage to a helper function that sets up a call
to _bt_readnextpage. Note that the 2017 parallel index scan commit
split _bt_steppage in two: it became the current _bt_steppage code,
plus the current _bt_readnextpage code. Making that relationship
explicit seems much clearer. I'm essentially finishing off the process
of splitting _bt_steppage up.

Side note: all of these simplifications were enabled by making the
preconditions for calling _bt_readnextpage work independently of the
scan direction (which was present in earlier versions of the patch
from Matthias). We now always do a
"BTScanPosUnpinIfPinned(so->currPos)" within _bt_steppage, so it's
easy to talk clearly about the preconditions/postconditions for
_bt_readnextpage without directly considering its _bt_steppage caller
(still the main caller, including during parallel index scans).

Performance improvement bonus: We do that
"BTScanPosUnpinIfPinned(so->currPos)" at a slightly different point in
_bt_steppage in v4, as compared to master, even during forwards scans:
we consistently unpin *before* the required next call to
_bt_parallel_seize takes place (unless, of course, it's a serial scan,
which'll never need to call _bt_parallel_seize). Presumably this is
why I see a ~5% increase in throughput for parallel index-only forward
(standard) scans with this v4 -- all parallel scans (backwards and
forwards alike) now receive at least some benefit from this patch. It
makes a certain amount of sense; we should do as little as possible
during any window where our backend has seized the scan.

I do need to do more testing, but the direction of v4 seems promising.

--
Peter Geoghegan

Attachments:

v4-0001-Optimize-and-simplify-nbtree-backwards-scans.patchapplication/octet-stream; name=v4-0001-Optimize-and-simplify-nbtree-backwards-scans.patchDownload

From d864b395f0c0012312d518c28787ae40dcdbe5b1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 8 Oct 2024 11:40:54 -0400
Subject: [PATCH v4] Optimize and simplify nbtree backwards scans.

---
 src/include/access/nbtree.h           |  11 +-
 src/backend/access/nbtree/nbtree.c    |  40 ++-
 src/backend/access/nbtree/nbtsearch.c | 463 +++++++++++++-------------
 src/backend/access/nbtree/nbtutils.c  |   2 +-
 4 files changed, 267 insertions(+), 249 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d64300fb9..c082bdf64 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -954,6 +954,7 @@ typedef struct BTScanPosData
 
 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* page referenced by items array */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 	BlockNumber nextPage;		/* page's right link when we scanned it */
 
 	/*
@@ -1022,6 +1023,7 @@ typedef BTScanPosData *BTScanPos;
 #define BTScanPosInvalidate(scanpos) \
 	do { \
 		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
 		(scanpos).nextPage = InvalidBlockNumber; \
 		(scanpos).buf = InvalidBuffer; \
 		(scanpos).lsn = InvalidXLogRecPtr; \
@@ -1090,7 +1092,6 @@ typedef struct BTReadPageState
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
-	BlockNumber prev_scan_page; /* previous _bt_parallel_release block */
 	Page		page;			/* Page being read */
 
 	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
@@ -1192,11 +1193,13 @@ extern int	btgettreeheight(Relation rel);
  * prototypes for internal functions in nbtree.c
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
-							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+							   BlockNumber *currblkno, bool first);
+extern void _bt_parallel_release(IndexScanDesc scan,
+								 BlockNumber next_scan_page,
+								 BlockNumber curr_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
-										   BlockNumber prev_scan_page);
+										   BlockNumber curr_page);
 
 /*
  * prototypes for functions in nbtdedup.c
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 56e502c4f..b926b99ba 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -66,7 +66,9 @@ typedef enum
  */
 typedef struct BTParallelScanDescData
 {
-	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BlockNumber btps_nextScanPage;	/* next page to be scanned */
+	BlockNumber btps_lastCurrPage;	/* page whose sibling link was copied into
+									 * btps_scanPage */
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
@@ -548,7 +550,8 @@ btinitparallelscan(void *target)
 	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
 
 	SpinLockInit(&bt_target->btps_mutex);
-	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_nextScanPage = InvalidBlockNumber;
+	bt_target->btps_lastCurrPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
@@ -573,7 +576,8 @@ btparallelrescan(IndexScanDesc scan)
 	 * consistency.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_nextScanPage = InvalidBlockNumber;
+	btscan->btps_lastCurrPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
 	SpinLockRelease(&btscan->btps_mutex);
 }
@@ -600,7 +604,8 @@ btparallelrescan(IndexScanDesc scan)
  * Callers should ignore the value of pageno if the return value is false.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
+				   BlockNumber *currblkno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		exit_loop = false;
@@ -609,6 +614,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 	BTParallelScanDesc btscan;
 
 	*pageno = P_NONE;
+	*currblkno = InvalidBlockNumber;
 
 	if (first)
 	{
@@ -684,7 +690,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 * of advancing it to a new page!
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
-			*pageno = btscan->btps_scanPage;
+			*pageno = btscan->btps_nextScanPage;
+			*currblkno = btscan->btps_lastCurrPage;
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -699,17 +706,22 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 
 /*
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
- *		new page.  We now have the new value btps_scanPage; some other backend
+ *		new page.  We now have the new value btps_nextScanPage; another backend
  *		can now begin advancing the scan.
  *
- * Callers whose scan uses array keys must save their scan_page argument so
+ * Callers whose scan uses array keys must save their curr_page argument so
  * that it can be passed to _bt_parallel_primscan_schedule, should caller
  * determine that another primitive index scan is required.  If that happens,
  * scan_page won't be scanned by any backend (unless the next primitive index
  * scan lands on scan_page).
+ *
+ * Note: unlike the serial case, parallel scans don't need to remember both
+ * sibling links.  Caller can just pass us whichever link is next for the
+ * current scan direction because it cannot change within a parallel scan.
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
+					 BlockNumber curr_page)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -718,7 +730,8 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_scanPage = scan_page;
+	btscan->btps_nextScanPage = next_scan_page;
+	btscan->btps_lastCurrPage = curr_page;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
@@ -774,13 +787,13 @@ _bt_parallel_done(IndexScanDesc scan)
 /*
  * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Caller passes the block number most recently passed to _bt_parallel_release
+ * Caller passes the curr_page most recently passed to _bt_parallel_release
  * by its backend.  Caller successfully schedules the next primitive index scan
  * if the shared parallel state hasn't been seized since caller's backend last
  * advanced the scan.
  */
 void
-_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
+_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -792,10 +805,11 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_scanPage == prev_scan_page &&
+	if (btscan->btps_lastCurrPage == curr_page &&
 		btscan->btps_pageStatus == BTPARALLEL_IDLE)
 	{
-		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_nextScanPage = InvalidBlockNumber;
+		btscan->btps_lastCurrPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
 
 		/* Serialize scan's current array keys */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fff7c89ea..047c891d8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -43,10 +43,12 @@ static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
-static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
-								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_readfirstpage(IndexScanDesc scan, ScanDirection dir,
+							  OffsetNumber offnum);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+							 BlockNumber lastcurrblkno, ScanDirection dir);
+static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
+										 BlockNumber lastcurrblkno);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
@@ -924,7 +926,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno, true);
+		BlockNumber lastcurrblkno;
+
+		status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, true);
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -942,7 +946,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			Assert(!so->needPrimScan);
+
+			_bt_initialize_more_data(so, dir);
+
+			if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir))
 				return false;
 			goto readcomplete;
 		}
@@ -1427,8 +1435,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
-	PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
-
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
@@ -1455,21 +1461,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readpage(scan, dir, offnum, true))
-	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
-		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
-			return false;
-	}
-	else
-	{
-		/* We have at least one item to return as scan's first item */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-	}
+	if (!_bt_readfirstpage(scan, dir, offnum))
+		return false;
 
 readcomplete:
 	/* OK, itemIndex says what to return */
@@ -1582,17 +1575,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
-	/* allow next page be processed by parallel worker */
-	if (scan->parallel_scan)
-	{
-		if (ScanDirectionIsForward(dir))
-			pstate.prev_scan_page = opaque->btpo_next;
-		else
-			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
-
-		_bt_parallel_release(scan, pstate.prev_scan_page);
-	}
-
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
 	arrayKeys = so->numArrayKeys != 0;
 	minoff = P_FIRSTDATAKEY(opaque);
@@ -1626,10 +1608,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
 
 	/*
-	 * we must save the page's right-link while scanning it; this tells us
-	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * we must save the page's right-link and left-link; this tells us where
+	 * to step right/left to after we're done with these items
 	 */
+	so->currPos.prevPage = opaque->btpo_prev;
 	so->currPos.nextPage = opaque->btpo_next;
 
 	/* initialize tuple workspace to empty */
@@ -1640,6 +1622,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * good.
 	 */
 	Assert(BTScanPosIsPinned(so->currPos));
+	Assert(!P_IGNORE(opaque));
+
+	/* allow next page to be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, so->currPos.nextPage,
+								 so->currPos.currPage);
+		else
+			_bt_parallel_release(scan, so->currPos.prevPage,
+								 so->currPos.currPage);
+	}
 
 	/*
 	 * Prechecking the value of the continuescan flag for the last item on the
@@ -1934,7 +1928,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * We don't need to visit page to the left when no more matches will
 		 * be found there
 		 */
-		if (!pstate.continuescan || P_LEFTMOST(opaque))
+		if (!pstate.continuescan)
 			so->currPos.moreLeft = false;
 
 		Assert(itemIndex >= 0);
@@ -2035,19 +2029,21 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
- * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
- * if pinned, we'll drop the pin before moving to next page.  The buffer is
- * not locked on entry.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked.
+ * This is a wrapper on _bt_readnextpage that performs final steps for the
+ * current page (including dropping its pin).  It sets up the _bt_readnextpage
+ * call using either local state saved by the last _bt_readpage call, or using
+ * shared state consumed by seizing the parallel scan.
  *
- * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
- * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
- * to InvalidBuffer.  We return true to indicate success.
+ * Parallel scan callers that have already seized the scan should directly
+ * call _bt_readnextpage, rather than calling here.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno = InvalidBlockNumber;
+	BlockNumber lastcurrblkno;
 	bool		status;
 
 	Assert(BTScanPosIsValid(so->currPos));
@@ -2098,85 +2094,139 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			else
 				so->markPos.moreLeft = true;
 		}
+
+		/* mark/restore not supported by parallel scans */
+		Assert(!scan->parallel_scan);
 	}
 
-	if (ScanDirectionIsForward(dir))
+	/* release the previously pinned leaf page buffer, if any */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	/* Walk to the next page with data */
+	if (!scan->parallel_scan)
 	{
-		/* Walk right to the next page with data */
-		if (scan->parallel_scan != NULL)
+		/* Not parallel, so use nextPage/prevPage from local scan state */
+		if (ScanDirectionIsForward(dir))
 		{
-			/*
-			 * Seize the scan to get the next block number; if the scan has
-			 * ended already, bail out.
-			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			if (!status)
-			{
-				/* release the previous buffer, if pinned */
-				BTScanPosUnpinIfPinned(so->currPos);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
+			blkno = so->currPos.nextPage;
+			/* Remember we left a page with data */
+			so->currPos.moreLeft = true;
 		}
 		else
 		{
-			/* Not parallel, so use the previously-saved nextPage link. */
-			blkno = so->currPos.nextPage;
+			blkno = so->currPos.prevPage;
+			/* Remember we left a page with data */
+			so->currPos.moreRight = true;
 		}
-
-		/* Remember we left a page with data */
-		so->currPos.moreLeft = true;
-
-		/* release the previous buffer, if pinned */
-		BTScanPosUnpinIfPinned(so->currPos);
+		lastcurrblkno = so->currPos.currPage;
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * Seize the scan to get the nextPage and currPage from shared
+		 * parallel state
+		 */
+		status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, false);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
 
-		if (scan->parallel_scan != NULL)
-		{
-			/*
-			 * Seize the scan to get the current block number; if the scan has
-			 * ended already, bail out.
-			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			BTScanPosUnpinIfPinned(so->currPos);
-			if (!status)
-			{
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-		}
-		else
-		{
-			/* Not parallel, so just use our own notion of the current page */
-			blkno = so->currPos.currPage;
-		}
+		/* no point in remembering we left a page with data here */
 	}
 
-	if (!_bt_readnextpage(scan, blkno, dir))
+	if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir))
 		return false;
 
-	/* We have at least one item to return as scan's next item */
-	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	return true;
+}
 
+/*
+ *	_bt_readfirstpage() -- Read first page containing valid data for scan
+ *
+ * Caller passes us a page offset number that's the first tuple that'll be
+ * read by _bt_readpage.  This offset typically comes from _bt_binsrch.
+ *
+ * On entry, so->currPos.buf must be pinned and locked.  Also, parallel scan
+ * callers must have seized the scan before calling here.
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page, and we return true.  We hold a pin on the buffer on
+ * success exit, except for scans that _bt_drop_lock_and_maybe_pin indicates
+ * can safely have their pin dropped to avoid blocking progress by VACUUM.
+ *
+ * If there are no more matching records in the given direction, we set
+ * so->currPos.buf to InvalidBuffer, and return false.  Note that this could
+ * just signal the end of the current primitive index scan.
+ *
+ * We always release the scan for a parallel scan caller, regardless of
+ * success or failure; we do this by calling _bt_parallel_release at the
+ * earliest opportunity.
+ */
+static bool
+_bt_readfirstpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(BufferIsValid(so->currPos.buf));
+	Assert(!P_IGNORE(BTPageGetOpaque(BufferGetPage(so->currPos.buf))));
+
+	PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf),
+					  scan->xs_snapshot);
+
+	if (_bt_readpage(scan, dir, offnum, true))
+	{
+		/* We have at least one item to return as scan's first item */
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		return true;
+	}
+
+	/*
+	 * There's no actually-matching data on this page.  Try to advance to the
+	 * next page.
+	 */
+	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	if (!_bt_steppage(scan, dir))
+		return false;
+
+	/*
+	 * We have at least one item to return as scan's first item (just not from
+	 * the first page it read).  No need to call _bt_drop_lock_and_maybe_pin
+	 * here; _bt_readnextpage already did that for us.
+	 */
 	return true;
 }
 
 /*
  *	_bt_readnextpage() -- Read next page containing valid data for scan
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  Caller must release the lock (and
- * maybe the pin) on the buffer on success exit.
+ * caller's blkno is the next interesting page's link, taken from either the
+ * previously-saved right link or left link.  lastcurrblkno is the page that
+ * was current at the point where the blkno link was saved, which we somtimes
+ * use to reason about concurrent page splits/page deletions (this is only
+ * required by backwards scans, but we always have it, just to be consistent).
  *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
+ * On entry, so->currPos.buf must not be pinned or locked.  Also, parallel
+ * scan callers must have seized the scan before calling here.
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page, and we return true.  We hold a pin on the buffer on
+ * success exit, except for scans that _bt_drop_lock_and_maybe_pin indicates
+ * can safely have their pin dropped to avoid blocking progress by VACUUM.
+ *
+ * If there are no more matching records in the given direction, we set
+ * so->currPos.buf to InvalidBuffer, and return false.  Note that this could
+ * just signal the end of the current primitive index scan.
+ *
+ * We always release the scan for a parallel scan caller, regardless of
+ * success or failure; we do this by calling _bt_parallel_release at the
+ * earliest opportunity.
  */
 static bool
-_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+				 BlockNumber lastcurrblkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -2186,6 +2236,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 
 	rel = scan->indexRelation;
 
+	Assert(!BTScanPosIsPinned(so->currPos));
+
 	if (ScanDirectionIsForward(dir))
 	{
 		for (;;)
@@ -2217,84 +2269,54 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 			else if (scan->parallel_scan != NULL)
 			{
-				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
+				/* _bt_readpage not called, so do this for ourselves */
+				_bt_parallel_release(scan, opaque->btpo_next, blkno);
 			}
 
-			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
+				/* no matching tuples, so seize a page to the right */
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, false);
 				if (!status)
 				{
+					/* don't call _bt_parallel_done */
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
 			}
 			else
 			{
+				/* no matching tuples, so consider following right link */
+				lastcurrblkno = blkno;
 				blkno = opaque->btpo_next;
 				_bt_relbuf(rel, so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
 			}
 		}
 	}
 	else
 	{
-		/*
-		 * Should only happen in parallel cases, when some other backend
-		 * advanced the scan.
-		 */
-		if (so->currPos.currPage != blkno)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			so->currPos.currPage = blkno;
-		}
-
-		/* Done if we know that the left sibling link isn't of interest */
-		if (!so->currPos.moreLeft)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			_bt_parallel_done(scan);
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
-
-		/*
-		 * Walk left to the next page with data.  This is much more complex
-		 * than the walk-right case because of the possibility that the page
-		 * to our left splits while we are in flight to it, plus the
-		 * possibility that the page we were on gets deleted after we leave
-		 * it.  See nbtree/README for details.
-		 *
-		 * It might be possible to rearrange this code to have less overhead
-		 * in pinning and locking, but that would require capturing the left
-		 * sibling block number when the page is initially read, and then
-		 * optimistically starting there (rather than pinning the page twice).
-		 * It is not clear that this would be worth the complexity.
-		 */
-		if (BTScanPosIsPinned(so->currPos))
-			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
-		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
 		for (;;)
 		{
-			/* Done if we know that the left sibling link isn't of interest */
-			if (!so->currPos.moreLeft)
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE || !so->currPos.moreLeft)
 			{
-				_bt_relbuf(rel, so->currPos.buf);
 				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 
-			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			/* move left at least one page (checks for interrupts, too) */
+			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
+														 lastcurrblkno);
 
-			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				/* must have been a concurrent deletion of leftmost page */
 				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
@@ -2309,7 +2331,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			opaque = BTPageGetOpaque(page);
 			if (!P_IGNORE(opaque))
 			{
-				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), false))
@@ -2317,51 +2339,34 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 			else if (scan->parallel_scan != NULL)
 			{
-				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+				/* _bt_readpage not called, so do this for ourselves */
+				Assert(P_ISHALFDEAD(opaque));
+				_bt_parallel_release(scan, opaque->btpo_prev, blkno);
 			}
 
-			/*
-			 * For parallel scans, get the last page scanned as it is quite
-			 * possible that by the time we try to seize the scan, some other
-			 * worker has already advanced the scan to a different page.  We
-			 * must continue based on the latest page scanned by any worker.
-			 */
 			if (scan->parallel_scan != NULL)
 			{
+				/* no matching tuples, so seize a page to the left */
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
+				status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, false);
 				if (!status)
 				{
+					/* don't call _bt_parallel_done */
 					BTScanPosInvalidate(so->currPos);
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
+			else
+			{
+				/* no matching tuples, so consider following left link */
+				lastcurrblkno = blkno;
+				blkno = opaque->btpo_prev;
+				_bt_relbuf(rel, so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
 			}
 		}
 	}
 
-	return true;
-}
-
-/*
- *	_bt_parallel_readpage() -- Read current page containing valid data for scan
- *
- * On success, release lock and maybe pin on buffer.  We return true to
- * indicate success.
- */
-static bool
-_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(!so->needPrimScan);
-
-	_bt_initialize_more_data(so, dir);
-
-	if (!_bt_readnextpage(scan, blkno, dir))
-		return false;
-
 	/* We have at least one item to return as scan's next item */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -2369,48 +2374,41 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 }
 
 /*
- * _bt_walk_left() -- step left one page, if possible
+ * _bt_lock_and_validate_left() -- lock caller's left sibling blkno,
+ * recovering from concurrent page splits/page deletions when necessary
  *
- * The given buffer must be pinned and read-locked.  This will be dropped
- * before stepping left.  On return, we have pin and read lock on the
- * returned page, instead.
+ * Called during backwards scans, to deal with their unique concurrency rules.
  *
- * Returns InvalidBuffer if there is no page to the left (no lock is held
- * in that case).
+ * blkno points to the block number of the page that we expect to move the
+ * scan to.  We'll successfully move the scan there when we find that its
+ * right sibling link points to lastcurrblkno, the page that we just finished
+ * reading.  Otherwise, we have to figure out which page is the correct one to
+ * visit next the hard way, which involves reasoning about both concurrent
+ * page splits and concurrent page deletions.  See nbtree/README for details.
+ *
+ * On return, we have both a pin and a read lock on the returned page, whose
+ * block number will be set in *blkno.  Returns InvalidBuffer if there is no
+ * page to the left (no lock is held in that case).
  *
  * It is possible for the returned leaf page to be half-dead; caller must
  * check that condition and step left again when required.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
+						   BlockNumber lastcurrblkno)
 {
-	Page		page;
-	BTPageOpaque opaque;
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
+	BlockNumber origblkno = *blkno; /* detects circular links */
 
 	for (;;)
 	{
-		BlockNumber obknum;
-		BlockNumber lblkno;
-		BlockNumber blkno;
+		Buffer		buf;
+		Page		page;
+		BTPageOpaque opaque;
 		int			tries;
 
-		/* if we're at end of tree, release buf and return failure */
-		if (P_LEFTMOST(opaque))
-		{
-			_bt_relbuf(rel, buf);
-			break;
-		}
-		/* remember original page we are stepping left from */
-		obknum = BufferGetBlockNumber(buf);
-		/* step left */
-		blkno = lblkno = opaque->btpo_prev;
-		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, *blkno, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
@@ -2427,21 +2425,26 @@ _bt_walk_left(Relation rel, Buffer buf)
 		tries = 0;
 		for (;;)
 		{
-			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
+			if (!P_ISDELETED(opaque) && opaque->btpo_next == lastcurrblkno)
 			{
 				/* Found desired page, return it */
 				return buf;
 			}
 			if (P_RIGHTMOST(opaque) || ++tries > 4)
 				break;
-			blkno = opaque->btpo_next;
-			buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
+			/* step right */
+			*blkno = opaque->btpo_next;
+			buf = _bt_relandgetbuf(rel, buf, *blkno, BT_READ);
 			page = BufferGetPage(buf);
 			opaque = BTPageGetOpaque(page);
 		}
 
-		/* Return to the original page to see what's up */
-		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
+		/*
+		 * Return to the original page (usually the page most recently read by
+		 * _bt_readpage, which is passed by caller as lastcurrblkno) to see
+		 * what's up with sibling links
+		 */
+		buf = _bt_relandgetbuf(rel, buf, lastcurrblkno, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		if (P_ISDELETED(opaque))
@@ -2457,8 +2460,8 @@ _bt_walk_left(Relation rel, Buffer buf)
 				if (P_RIGHTMOST(opaque))
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
-				blkno = opaque->btpo_next;
-				buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
+				*blkno = opaque->btpo_next;
+				buf = _bt_relandgetbuf(rel, buf, *blkno, BT_READ);
 				page = BufferGetPage(buf);
 				opaque = BTPageGetOpaque(page);
 				if (!P_ISDELETED(opaque))
@@ -2466,9 +2469,10 @@ _bt_walk_left(Relation rel, Buffer buf)
 			}
 
 			/*
-			 * Now return to top of loop, resetting obknum to point to this
-			 * nondeleted page, and try again.
+			 * remember that this page is actually now the leftmost one whose
+			 * key space we've already read all tuples from
 			 */
+			lastcurrblkno = BufferGetBlockNumber(buf);
 		}
 		else
 		{
@@ -2477,11 +2481,22 @@ _bt_walk_left(Relation rel, Buffer buf)
 			 * to the left got split or deleted. Without this check, we'd go
 			 * into an infinite loop if there's anything wrong.
 			 */
-			if (opaque->btpo_prev == lblkno)
+			if (opaque->btpo_prev == origblkno)
 				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
-					 obknum, RelationGetRelationName(rel));
-			/* Okay to try again with new lblkno value */
+					 lastcurrblkno, RelationGetRelationName(rel));
+			/* Okay to try again, since left sibling link changed */
 		}
+
+		if (P_LEFTMOST(opaque))
+		{
+			/* previous leftmost page must have been concurrently deleted */
+			_bt_relbuf(rel, buf);
+			break;
+		}
+
+		/* next loop iteration will start over with a clean slate */
+		*blkno = origblkno = opaque->btpo_prev;
+		_bt_relbuf(rel, buf);
 	}
 
 	return InvalidBuffer;
@@ -2605,7 +2620,6 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
@@ -2637,21 +2651,8 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start, true))
-	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
-		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
-			return false;
-	}
-	else
-	{
-		/* We have at least one item to return as scan's first item */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-	}
+	if (!_bt_readfirstpage(scan, dir, start))
+		return false;
 
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index b4ba51357..03d5c671f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2442,7 +2442,7 @@ new_prim_scan:
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, pstate->prev_scan_page);
+		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
-- 
2.45.2

#10

Peter Geoghegan

pg@bowt.ie

over 1 year ago

In reply to: Peter Geoghegan (#9)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

The main simplification new to v4 is that v4 isolates the need to call
_bt_drop_lock_and_maybe_pin to only 2 functions: _bt_readnextpage, and
its new _bt_readfirstpage "sibling" function. The functions have
similar preconditions, and identical postconditions -- all of which
are now clearly documented.

Worked on this a little more today. Attached is v5, which goes even
further than v4 did. Highlights:

* Completely gets rid of _bt_initialize_more_data by moving its code
into the new _bt_readfirstpage function.

* Adds similar "initialize moreLeft/moreRight" code to the
corresponding point within _bt_readnextpage (that is, at the start of
_bt_readnextpage) -- meaning I moved a bit more code from _bt_steppage
into _bt_readnextpage.

Again, _bt_readnextpage and the new _bt_readfirstpage function are
intended to be "sibling" functions (while _bt_steppage is demoted to a
helper that sets up a call to _bt_readfirstpage). The matching
handling of currPos initialization within _bt_readnextpage and the
_bt_readfirstpage pushes this further than in v4.

* We now reset currPos state (including even its moreLeft/moreRight
fields) within _bt_parallel_seize, automatically and regardless of any
other details.

It doesn't really make sense that (up until now) we've trusted the
shared state that we seized from the parallel scan to kinda be in sync
with the local currPos state (at least its moreLeft/moreRight need to
be in sync, to avoid confusion within _bt_readnextpage). This allowed
me to remove several parallel-only BTScanPosInvalidate calls from
_bt_readnextpage.

This "currPos must be kept in sync with parallel scan state" confusion
is exemplified by the following code snippet (which FWIW was already
removed by earlier revisions of the patch), that appears in the
backwards scan block of _bt_readnextpage on master:

"""
/*
* Should only happen in parallel cases, when some other backend
* advanced the scan.
*/
if (so->currPos.currPage != blkno)
{
BTScanPosUnpinIfPinned(so->currPos);
so->currPos.currPage = blkno;
}
"""

Why should the parallel scan have any concern for what currPos is, in general?

* Now nbtree has only one PredicateLockPage call, inside _bt_readpage.
This saves us an extra BufferGetBlockNumber call. (v4 pushed
PredicateLockPage calls down one layer, v5 pushes them down another
layer still.)

* Improved precondition/postcondition comments, and generally cleaner
separation between _bt_steppage and _bt_readnextpage.

--
Peter Geoghegan

Attachments:

v5-0001-Optimize-and-simplify-nbtree-backwards-scans.patchapplication/x-patch; name=v5-0001-Optimize-and-simplify-nbtree-backwards-scans.patchDownload

From b3c892d428688057378c5a74e2bb313c1a6b4c2e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 8 Oct 2024 11:40:54 -0400
Subject: [PATCH v5] Optimize and simplify nbtree backwards scans.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
---
 src/include/access/nbtree.h           |  21 +-
 src/backend/access/nbtree/nbtree.c    |  49 ++-
 src/backend/access/nbtree/nbtsearch.c | 603 +++++++++++++-------------
 src/backend/access/nbtree/nbtutils.c  |   2 +-
 4 files changed, 347 insertions(+), 328 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d64300fb9..594eddaec 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -954,6 +954,7 @@ typedef struct BTScanPosData
 
 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* page referenced by items array */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 	BlockNumber nextPage;		/* page's right link when we scanned it */
 
 	/*
@@ -965,15 +966,6 @@ typedef struct BTScanPosData
 	bool		moreLeft;
 	bool		moreRight;
 
-	/*
-	 * Direction of the scan at the time that _bt_readpage was called.
-	 *
-	 * Used by btrestrpos to "restore" the scan's array keys by resetting each
-	 * array to its first element's value (first in this scan direction). This
-	 * avoids the need to directly track the array keys in btmarkpos.
-	 */
-	ScanDirection dir;
-
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
 	 * location in the associated tuple storage workspace.
@@ -1022,6 +1014,7 @@ typedef BTScanPosData *BTScanPos;
 #define BTScanPosInvalidate(scanpos) \
 	do { \
 		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
 		(scanpos).nextPage = InvalidBlockNumber; \
 		(scanpos).buf = InvalidBuffer; \
 		(scanpos).lsn = InvalidXLogRecPtr; \
@@ -1071,6 +1064,7 @@ typedef struct BTScanOpaqueData
 	 * determine if there is a mark, first look at markItemIndex, then at
 	 * markPos.
 	 */
+	ScanDirection markDir;		/* direction of scan with mark */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
 	/* keep these last in struct for efficiency */
@@ -1090,7 +1084,6 @@ typedef struct BTReadPageState
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
-	BlockNumber prev_scan_page; /* previous _bt_parallel_release block */
 	Page		page;			/* Page being read */
 
 	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
@@ -1192,11 +1185,13 @@ extern int	btgettreeheight(Relation rel);
  * prototypes for internal functions in nbtree.c
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
-							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+							   BlockNumber *currblkno, bool first);
+extern void _bt_parallel_release(IndexScanDesc scan,
+								 BlockNumber next_scan_page,
+								 BlockNumber curr_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
-										   BlockNumber prev_scan_page);
+										   BlockNumber curr_page);
 
 /*
  * prototypes for functions in nbtdedup.c
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 56e502c4f..db20b1138 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -66,7 +66,9 @@ typedef enum
  */
 typedef struct BTParallelScanDescData
 {
-	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BlockNumber btps_nextScanPage;	/* next page to be scanned */
+	BlockNumber btps_lastCurrPage;	/* page whose sibling link was copied into
+									 * btps_scanPage */
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
@@ -520,7 +522,7 @@ btrestrpos(IndexScanDesc scan)
 			/* Reset the scan's array keys (see _bt_steppage for why) */
 			if (so->numArrayKeys)
 			{
-				_bt_start_array_keys(scan, so->currPos.dir);
+				_bt_start_array_keys(scan, so->markDir);
 				so->needPrimScan = false;
 			}
 		}
@@ -548,7 +550,8 @@ btinitparallelscan(void *target)
 	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
 
 	SpinLockInit(&bt_target->btps_mutex);
-	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_nextScanPage = InvalidBlockNumber;
+	bt_target->btps_lastCurrPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
@@ -573,7 +576,8 @@ btparallelrescan(IndexScanDesc scan)
 	 * consistency.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_nextScanPage = InvalidBlockNumber;
+	btscan->btps_lastCurrPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
 	SpinLockRelease(&btscan->btps_mutex);
 }
@@ -600,7 +604,8 @@ btparallelrescan(IndexScanDesc scan)
  * Callers should ignore the value of pageno if the return value is false.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
+				   BlockNumber *currblkno, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		exit_loop = false;
@@ -609,6 +614,14 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 	BTParallelScanDesc btscan;
 
 	*pageno = P_NONE;
+	*currblkno = InvalidBlockNumber;
+
+	/*
+	 * Reset moreLeft/moreRight.  We don't want to retain information from the
+	 * last time the backend seized the scan.
+	 */
+	BTScanPosInvalidate(so->currPos);
+	so->currPos.moreLeft = so->currPos.moreRight = true;
 
 	if (first)
 	{
@@ -684,7 +697,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 * of advancing it to a new page!
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
-			*pageno = btscan->btps_scanPage;
+			*pageno = btscan->btps_nextScanPage;
+			*currblkno = btscan->btps_lastCurrPage;
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -699,17 +713,22 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 
 /*
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
- *		new page.  We now have the new value btps_scanPage; some other backend
+ *		new page.  We now have the new value btps_nextScanPage; another backend
  *		can now begin advancing the scan.
  *
- * Callers whose scan uses array keys must save their scan_page argument so
+ * Callers whose scan uses array keys must save their curr_page argument so
  * that it can be passed to _bt_parallel_primscan_schedule, should caller
  * determine that another primitive index scan is required.  If that happens,
  * scan_page won't be scanned by any backend (unless the next primitive index
  * scan lands on scan_page).
+ *
+ * Note: unlike the serial case, parallel scans don't need to remember both
+ * sibling links.  Caller can just pass us whichever link is next for the scan
+ * direction because the direction cannot change when parallelism is in use.
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
+					 BlockNumber curr_page)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -718,7 +737,8 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_scanPage = scan_page;
+	btscan->btps_nextScanPage = next_scan_page;
+	btscan->btps_lastCurrPage = curr_page;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
@@ -774,13 +794,13 @@ _bt_parallel_done(IndexScanDesc scan)
 /*
  * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Caller passes the block number most recently passed to _bt_parallel_release
+ * Caller passes the curr_page most recently passed to _bt_parallel_release
  * by its backend.  Caller successfully schedules the next primitive index scan
  * if the shared parallel state hasn't been seized since caller's backend last
  * advanced the scan.
  */
 void
-_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
+_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -792,10 +812,11 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_scanPage == prev_scan_page &&
+	if (btscan->btps_lastCurrPage == curr_page &&
 		btscan->btps_pageStatus == BTPARALLEL_IDLE)
 	{
-		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_nextScanPage = InvalidBlockNumber;
+		btscan->btps_lastCurrPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
 
 		/* Serialize scan's current array keys */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fff7c89ea..cd9092130 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -43,12 +43,13 @@ static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
-static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
-								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
+							  ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+							 BlockNumber lastcurrblkno, ScanDirection dir);
+static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
+										 BlockNumber lastcurrblkno);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -889,7 +890,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysz = 0;
 	int			i;
-	bool		status;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
 	BlockNumber blkno;
@@ -924,7 +924,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno, true);
+		bool		status;
+		BlockNumber lastcurrblkno;
+
+		status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, true);
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -942,7 +945,14 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			Assert(!so->needPrimScan);
+
+			/*
+			 * We anticipated starting another primitive scan, but some other
+			 * worker bet us to it.  Bypass _bt_readfirstpage by calling
+			 * _bt_readnextpage directly.
+			 */
+			if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir))
 				return false;
 			goto readcomplete;
 		}
@@ -1427,10 +1437,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
-	PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
-
-	_bt_initialize_more_data(so, dir);
-
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
 	Assert(!BTScanPosIsValid(so->currPos));
@@ -1455,21 +1461,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readpage(scan, dir, offnum, true))
-	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
-		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
-			return false;
-	}
-	else
-	{
-		/* We have at least one item to return as scan's first item */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-	}
+	if (!_bt_readfirstpage(scan, offnum, dir))
+		return false;
 
 readcomplete:
 	/* OK, itemIndex says what to return */
@@ -1545,14 +1538,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
  *
- * _bt_first caller passes us an offnum returned by _bt_binsrch, which might
- * be an out of bounds offnum such as "maxoff + 1" in certain corner cases.
- * _bt_checkkeys will stop the scan as soon as an equality qual fails (when
- * its scan key was marked required), so _bt_first _must_ pass us an offnum
- * exactly at the beginning of where equal tuples are to be found.  When we're
- * passed an offnum past the end of the page, we might still manage to stop
- * the scan on this page by calling _bt_checkkeys against the high key.
- *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
  * _bt_parallel_release before returning.
@@ -1563,6 +1548,7 @@ static bool
 _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 bool firstPage)
 {
+	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
@@ -1582,18 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
-	/* allow next page be processed by parallel worker */
-	if (scan->parallel_scan)
-	{
-		if (ScanDirectionIsForward(dir))
-			pstate.prev_scan_page = opaque->btpo_next;
-		else
-			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
-
-		_bt_parallel_release(scan, pstate.prev_scan_page);
-	}
-
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	arrayKeys = so->numArrayKeys != 0;
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -1615,8 +1590,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	/*
 	 * We note the buffer's block number so that we can release the pin later.
 	 * This allows us to re-read the buffer if it is needed again for hinting.
+	 * Also save the page's sibling links; this tells us where to step
+	 * right/left to after we're done reading items from this page.
 	 */
 	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+	so->currPos.prevPage = opaque->btpo_prev;
+	so->currPos.nextPage = opaque->btpo_next;
 
 	/*
 	 * We save the LSN of the page as we read it, so that we know whether it
@@ -1625,13 +1604,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
 
-	/*
-	 * we must save the page's right-link while scanning it; this tells us
-	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
-	 */
-	so->currPos.nextPage = opaque->btpo_next;
-
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
 
@@ -1640,6 +1612,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 * good.
 	 */
 	Assert(BTScanPosIsPinned(so->currPos));
+	Assert(!P_IGNORE(opaque));
+
+	/* allow next page to be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, so->currPos.nextPage,
+								 so->currPos.currPage);
+		else
+			_bt_parallel_release(scan, so->currPos.prevPage,
+								 so->currPos.currPage);
+	}
+
+	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
 
 	/*
 	 * Prechecking the value of the continuescan flag for the last item on the
@@ -1804,7 +1790,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+			truncatt = BTreeTupleGetNAtts(itup, rel);
 			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
 			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
@@ -1934,7 +1920,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * We don't need to visit page to the left when no more matches will
 		 * be found there
 		 */
-		if (!pstate.continuescan || P_LEFTMOST(opaque))
+		if (!pstate.continuescan)
 			so->currPos.moreLeft = false;
 
 		Assert(itemIndex >= 0);
@@ -2035,20 +2021,21 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
- * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
- * if pinned, we'll drop the pin before moving to next page.  The buffer is
- * not locked on entry.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked.
+ * This is a wrapper on _bt_readnextpage that performs final steps for the
+ * current page (including dropping its pin).  It sets up the _bt_readnextpage
+ * call using either local state saved by the last _bt_readpage call, or using
+ * shared parallel scan state (by seizing the parallel scan).
  *
- * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
- * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
- * to InvalidBuffer.  We return true to indicate success.
+ * Parallel scan callers that have already seized the scan should directly
+ * call _bt_readnextpage, rather than calling here.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno = InvalidBlockNumber;
-	bool		status;
+	BlockNumber blkno,
+				lastcurrblkno;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -2090,7 +2077,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		 * In effect, btrestpos leaves advancing the arrays up to the first
 		 * _bt_readpage call (that takes place after it has restored markPos).
 		 */
-		Assert(so->markPos.dir == dir);
+		Assert(so->markDir == dir);
 		if (so->needPrimScan)
 		{
 			if (ScanDirectionIsForward(dir))
@@ -2098,93 +2085,193 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			else
 				so->markPos.moreLeft = true;
 		}
+
+		/* mark/restore not supported by parallel scans */
+		Assert(!scan->parallel_scan);
 	}
 
-	if (ScanDirectionIsForward(dir))
+	/* release the previously pinned leaf page buffer, if any */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	/* Walk to the next page with data */
+	if (!scan->parallel_scan)
 	{
-		/* Walk right to the next page with data */
-		if (scan->parallel_scan != NULL)
-		{
-			/*
-			 * Seize the scan to get the next block number; if the scan has
-			 * ended already, bail out.
-			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			if (!status)
-			{
-				/* release the previous buffer, if pinned */
-				BTScanPosUnpinIfPinned(so->currPos);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-		}
-		else
-		{
-			/* Not parallel, so use the previously-saved nextPage link. */
+		/* Not parallel, so use nextPage/prevPage from local scan state */
+		if (ScanDirectionIsForward(dir))
 			blkno = so->currPos.nextPage;
-		}
-
-		/* Remember we left a page with data */
-		so->currPos.moreLeft = true;
-
-		/* release the previous buffer, if pinned */
-		BTScanPosUnpinIfPinned(so->currPos);
+		else
+			blkno = so->currPos.prevPage;
+		lastcurrblkno = so->currPos.currPage;
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
-
-		if (scan->parallel_scan != NULL)
+		/*
+		 * Seize the scan to get the nextPage and currPage from shared
+		 * parallel state
+		 */
+		if (!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
 		{
-			/*
-			 * Seize the scan to get the current block number; if the scan has
-			 * ended already, bail out.
-			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			BTScanPosUnpinIfPinned(so->currPos);
-			if (!status)
-			{
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-		}
-		else
-		{
-			/* Not parallel, so just use our own notion of the current page */
-			blkno = so->currPos.currPage;
+			/* don't call _bt_parallel_done */
+			return false;
 		}
 	}
 
-	if (!_bt_readnextpage(scan, blkno, dir))
-		return false;
+	/*
+	 * _bt_readnextpage just works off of blkno/lastcurrblkno now
+	 *
+	 * (It does still expect so->currPos.moreLeft/so->currPos.moreLeft to
+	 * remain however the last _bt_readpage call for lastcurrblkno set them.)
+	 */
+	BTScanPosInvalidate(so->currPos);
 
-	/* We have at least one item to return as scan's next item */
-	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir))
+		return false;
 
 	return true;
 }
 
 /*
- *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
+ *
+ * _bt_first caller passes us an offnum returned by _bt_binsrch, which might
+ * be an out of bounds offnum such as "maxoff + 1" in certain corner cases.
+ * _bt_checkkeys will stop the scan as soon as an equality qual fails (when
+ * its scan key was marked required), so _bt_first _must_ pass us an offnum
+ * exactly at the beginning of where equal tuples are to be found.  When we're
+ * passed an offnum past the end of the page, we might still manage to stop
+ * the scan on this page by calling _bt_checkkeys against the high key.  See
+ * _bt_readpage for full details.
+ *
+ * On entry, so->currPos.buf must be pinned and locked.  Also, parallel scan
+ * callers must have seized the scan before calling here.
  *
  * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  Caller must release the lock (and
- * maybe the pin) on the buffer on success exit.
+ * interesting page, and we return true.  We hold a pin on the buffer on
+ * success exit, except for scans that _bt_drop_lock_and_maybe_pin indicates
+ * can safely have their pin dropped to avoid blocking progress by VACUUM.
  *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
+ * If there are no more matching records in the given direction, we set
+ * so->currPos.buf to InvalidBuffer, and return false.  Note that this could
+ * just signal the end of the current primitive index scan.
+ *
+ * We always release the scan for a parallel scan caller, regardless of
+ * success or failure; we do this by calling _bt_parallel_release at the
+ * earliest opportunity.
  */
 static bool
-_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
+
+	Assert(BufferIsValid(so->currPos.buf));
+
+	/* Initialize currPos for so->currPos.buf */
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->markDir = dir;
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+
+	if (_bt_readpage(scan, dir, offnum, true))
+	{
+		/*
+		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+		 * so->currPos.buf in preparation for btgettuple returning tuples.
+		 */
+		Assert(BTScanPosIsPinned(so->currPos));
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		return true;
+	}
+
+	/* There's no actually-matching data on the page in so->currPos.buf */
+	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+
+	/* Call _bt_readnextpage through its _bt_steppage wrapper function */
+	if (!_bt_steppage(scan, dir))
+		return false;
+
+	/*
+	 * _bt_readpage for a later page succeeded.  There's no need to call
+	 * _bt_drop_lock_and_maybe_pin here, though; _bt_readnextpage already did
+	 * that for us.
+	 */
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for _bt_next
+ *
+ * Caller's blkno is the next interesting page's link, taken from either the
+ * previously-saved right link or left link.  lastcurrblkno is the page that
+ * was current at the point where the blkno link was saved, which we use to
+ * reason about concurrent page splits/page deletions during backwards scans
+ * (_bt_parallel_seize also requires it, regardless of scan direction).
+ *
+ * We work off of blkno and lastcurrblkno.  On entry, so->currPos must be
+ * invalid; caller shouldn't hold any locks or pins on any page.  Parallel
+ * scan callers must have seized the scan before calling here (and must pass
+ * us the blkno + lastcurrblkno they obtained as part of seizing the scan).
+ * However, we still expect so->currPos.moreLeft/so->currPos.moreLeft to
+ * remain however the last _bt_readpage call (for lastcurrblkno) set them.
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page, and we return true.  We hold a pin on the buffer on
+ * success exit, except for scans that _bt_drop_lock_and_maybe_pin indicates
+ * can safely have their pin dropped to avoid blocking progress by VACUUM.
+ *
+ * If there are no more matching records in the given direction, we leave
+ * so->currPos.buf set to InvalidBuffer, and return false.  No locks or pins
+ * are retained.  Note that this could just signal the end of the current
+ * primitive index scan (in which case another call to _bt_first will be
+ * required to continue the scan in the current scan direction).
+ *
+ * On exit, we'll always have released the seized scan for parallel scan
+ * callers (regardless of whether we were successful or not).
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+				 BlockNumber lastcurrblkno, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
-	bool		status;
 
-	rel = scan->indexRelation;
+	Assert(!BTScanPosIsValid(so->currPos));
+
+	/*
+	 * Initialize currPos for blkno.
+	 *
+	 * We know that the scan just read a page (lastcurrblkno), so remember
+	 * that we left a place with potentially matching tuples to blkno's left
+	 * (or to blkno's right, when we're scanning backwards).
+	 *
+	 * Don't reset moreRight (or moreLeft if this is a backwards scan) here.
+	 * If there is no more data to the right of lastcurrblkno, then there
+	 * certainly cannot be any more to the right of its blkno right sibling
+	 * (when we're scanning backwards then moreLeft similarly applies to both
+	 * lastcurrblkno and blkno).
+	 */
+	if (ScanDirectionIsForward(dir))
+		so->currPos.moreLeft = true;
+	else
+		so->currPos.moreRight = true;
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -2197,7 +2284,6 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
 				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 			/* check for interrupts while we're not holding any buffer lock */
@@ -2209,7 +2295,6 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
-				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
 				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), false))
@@ -2217,86 +2302,54 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 			else if (scan->parallel_scan != NULL)
 			{
-				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
+				/* _bt_readpage not called, so do this for ourselves */
+				_bt_parallel_release(scan, opaque->btpo_next, blkno);
 			}
 
-			/* nope, keep going */
 			if (scan->parallel_scan != NULL)
 			{
+				/* no matching tuples, so seize a page to the right */
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
-				if (!status)
+
+				if (!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
 				{
-					BTScanPosInvalidate(so->currPos);
+					/* don't call _bt_parallel_done */
 					return false;
 				}
 			}
 			else
 			{
+				/* no matching tuples, so consider following right link */
+				lastcurrblkno = blkno;
 				blkno = opaque->btpo_next;
 				_bt_relbuf(rel, so->currPos.buf);
 			}
+
+			BTScanPosInvalidate(so->currPos);	/* working off new blkno now */
 		}
 	}
 	else
 	{
-		/*
-		 * Should only happen in parallel cases, when some other backend
-		 * advanced the scan.
-		 */
-		if (so->currPos.currPage != blkno)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			so->currPos.currPage = blkno;
-		}
-
-		/* Done if we know that the left sibling link isn't of interest */
-		if (!so->currPos.moreLeft)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			_bt_parallel_done(scan);
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
-
-		/*
-		 * Walk left to the next page with data.  This is much more complex
-		 * than the walk-right case because of the possibility that the page
-		 * to our left splits while we are in flight to it, plus the
-		 * possibility that the page we were on gets deleted after we leave
-		 * it.  See nbtree/README for details.
-		 *
-		 * It might be possible to rearrange this code to have less overhead
-		 * in pinning and locking, but that would require capturing the left
-		 * sibling block number when the page is initially read, and then
-		 * optimistically starting there (rather than pinning the page twice).
-		 * It is not clear that this would be worth the complexity.
-		 */
-		if (BTScanPosIsPinned(so->currPos))
-			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
-		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
 		for (;;)
 		{
-			/* Done if we know that the left sibling link isn't of interest */
-			if (!so->currPos.moreLeft)
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE || !so->currPos.moreLeft)
 			{
-				_bt_relbuf(rel, so->currPos.buf);
 				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 
-			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
+			/* move left at least one page (checks for interrupts, too) */
+			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
+														 lastcurrblkno);
 
-			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				/* must have been a concurrent deletion of leftmost page */
 				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
 
@@ -2309,7 +2362,6 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			opaque = BTPageGetOpaque(page);
 			if (!P_IGNORE(opaque))
 			{
-				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), false))
@@ -2317,100 +2369,80 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			}
 			else if (scan->parallel_scan != NULL)
 			{
-				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+				/* _bt_readpage not called, so do this for ourselves */
+				Assert(P_ISHALFDEAD(opaque));
+				_bt_parallel_release(scan, opaque->btpo_prev, blkno);
 			}
 
-			/*
-			 * For parallel scans, get the last page scanned as it is quite
-			 * possible that by the time we try to seize the scan, some other
-			 * worker has already advanced the scan to a different page.  We
-			 * must continue based on the latest page scanned by any worker.
-			 */
 			if (scan->parallel_scan != NULL)
 			{
+				/* no matching tuples, so seize a page to the left */
 				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
-				if (!status)
+
+				if (!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
 				{
-					BTScanPosInvalidate(so->currPos);
+					/* don't call _bt_parallel_done */
 					return false;
 				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
 			}
+			else
+			{
+				/* no matching tuples, so consider following left link */
+				lastcurrblkno = blkno;
+				blkno = opaque->btpo_prev;
+				_bt_relbuf(rel, so->currPos.buf);
+			}
+
+			BTScanPosInvalidate(so->currPos);	/* working off new blkno now */
 		}
 	}
 
-	return true;
-}
-
-/*
- *	_bt_parallel_readpage() -- Read current page containing valid data for scan
- *
- * On success, release lock and maybe pin on buffer.  We return true to
- * indicate success.
- */
-static bool
-_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(!so->needPrimScan);
-
-	_bt_initialize_more_data(so, dir);
-
-	if (!_bt_readnextpage(scan, blkno, dir))
-		return false;
-
-	/* We have at least one item to return as scan's next item */
+	/*
+	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+	 * so->currPos.buf in preparation for btgettuple returning tuples.
+	 */
+	Assert(BTScanPosIsPinned(so->currPos));
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
 	return true;
 }
 
 /*
- * _bt_walk_left() -- step left one page, if possible
+ * _bt_lock_and_validate_left() -- lock caller's left sibling blkno,
+ * recovering from concurrent page splits/page deletions when necessary
  *
- * The given buffer must be pinned and read-locked.  This will be dropped
- * before stepping left.  On return, we have pin and read lock on the
- * returned page, instead.
+ * Called during backwards scans, to deal with their unique concurrency rules.
  *
- * Returns InvalidBuffer if there is no page to the left (no lock is held
- * in that case).
+ * blkno points to the block number of the page that we expect to move the
+ * scan to.  We'll successfully move the scan there when we find that its
+ * right sibling link points to lastcurrblkno, the page that we just finished
+ * reading.  Otherwise, we have to figure out which page is the correct one to
+ * visit next the hard way, which involves reasoning about both concurrent
+ * page splits and concurrent page deletions.  See nbtree/README for details.
+ *
+ * On return, we have both a pin and a read lock on the returned page, whose
+ * block number will be set in *blkno.  Returns InvalidBuffer if there is no
+ * page to the left (no lock is held in that case).
  *
  * It is possible for the returned leaf page to be half-dead; caller must
  * check that condition and step left again when required.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
+						   BlockNumber lastcurrblkno)
 {
-	Page		page;
-	BTPageOpaque opaque;
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
+	BlockNumber origblkno = *blkno; /* detects circular links */
 
 	for (;;)
 	{
-		BlockNumber obknum;
-		BlockNumber lblkno;
-		BlockNumber blkno;
+		Buffer		buf;
+		Page		page;
+		BTPageOpaque opaque;
 		int			tries;
 
-		/* if we're at end of tree, release buf and return failure */
-		if (P_LEFTMOST(opaque))
-		{
-			_bt_relbuf(rel, buf);
-			break;
-		}
-		/* remember original page we are stepping left from */
-		obknum = BufferGetBlockNumber(buf);
-		/* step left */
-		blkno = lblkno = opaque->btpo_prev;
-		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, *blkno, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
@@ -2427,21 +2459,26 @@ _bt_walk_left(Relation rel, Buffer buf)
 		tries = 0;
 		for (;;)
 		{
-			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
+			if (!P_ISDELETED(opaque) && opaque->btpo_next == lastcurrblkno)
 			{
 				/* Found desired page, return it */
 				return buf;
 			}
 			if (P_RIGHTMOST(opaque) || ++tries > 4)
 				break;
-			blkno = opaque->btpo_next;
-			buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
+			/* step right */
+			*blkno = opaque->btpo_next;
+			buf = _bt_relandgetbuf(rel, buf, *blkno, BT_READ);
 			page = BufferGetPage(buf);
 			opaque = BTPageGetOpaque(page);
 		}
 
-		/* Return to the original page to see what's up */
-		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
+		/*
+		 * Return to the original page (usually the page most recently read by
+		 * _bt_readpage, which is passed by caller as lastcurrblkno) to see
+		 * what's up with sibling links
+		 */
+		buf = _bt_relandgetbuf(rel, buf, lastcurrblkno, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		if (P_ISDELETED(opaque))
@@ -2457,8 +2494,8 @@ _bt_walk_left(Relation rel, Buffer buf)
 				if (P_RIGHTMOST(opaque))
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
-				blkno = opaque->btpo_next;
-				buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
+				*blkno = opaque->btpo_next;
+				buf = _bt_relandgetbuf(rel, buf, *blkno, BT_READ);
 				page = BufferGetPage(buf);
 				opaque = BTPageGetOpaque(page);
 				if (!P_ISDELETED(opaque))
@@ -2466,9 +2503,10 @@ _bt_walk_left(Relation rel, Buffer buf)
 			}
 
 			/*
-			 * Now return to top of loop, resetting obknum to point to this
-			 * nondeleted page, and try again.
+			 * remember that this page is actually now the leftmost one whose
+			 * key space we've already read all tuples from
 			 */
+			lastcurrblkno = BufferGetBlockNumber(buf);
 		}
 		else
 		{
@@ -2477,11 +2515,22 @@ _bt_walk_left(Relation rel, Buffer buf)
 			 * to the left got split or deleted. Without this check, we'd go
 			 * into an infinite loop if there's anything wrong.
 			 */
-			if (opaque->btpo_prev == lblkno)
+			if (opaque->btpo_prev == origblkno)
 				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
-					 obknum, RelationGetRelationName(rel));
-			/* Okay to try again with new lblkno value */
+					 lastcurrblkno, RelationGetRelationName(rel));
+			/* Okay to try again, since left sibling link changed */
 		}
+
+		if (P_LEFTMOST(opaque))
+		{
+			/* previous leftmost page must have been concurrently deleted */
+			_bt_relbuf(rel, buf);
+			break;
+		}
+
+		/* next loop iteration will start over with a clean slate */
+		*blkno = origblkno = opaque->btpo_prev;
+		_bt_relbuf(rel, buf);
 	}
 
 	return InvalidBuffer;
@@ -2605,7 +2654,6 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
 	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
@@ -2632,26 +2680,11 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	_bt_initialize_more_data(so, dir);
-
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start, true))
-	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
-		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
-			return false;
-	}
-	else
-	{
-		/* We have at least one item to return as scan's first item */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-	}
+	if (!_bt_readfirstpage(scan, start, dir))
+		return false;
 
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
@@ -2661,33 +2694,3 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
-
-/*
- * _bt_initialize_more_data() -- initialize moreLeft, moreRight and scan dir
- * from currPos
- */
-static inline void
-_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
-{
-	so->currPos.dir = dir;
-	if (so->needPrimScan)
-	{
-		Assert(so->numArrayKeys);
-
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
-		so->needPrimScan = false;
-	}
-	else if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index b4ba51357..03d5c671f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2442,7 +2442,7 @@ new_prim_scan:
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, pstate->prev_scan_page);
+		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
-- 
2.45.2

#11

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Peter Geoghegan (#10)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Fri, Oct 11, 2024 at 7:29 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v5

Now I'm attaching a v6, which further polishes things. Current plan is to
commit something very close to this in the next day or two.

v6 is mostly just further comment polishing. But it also merges together
the two existing _bt_readnextpage loops (for forward and backward scan
directions) into one single loop that handles everything. This is
definitely a win for brevity, and might also be a win for clarity --
but I'm not 100% sure which way I prefer it just yet.

I'll need to make a final decision on this loop merging business
before committing. Anybody else have an opinion, either way?

--
Peter Geoghegan

Attachments:

v6-0001-Optimize-and-simplify-nbtree-backwards-scans.patchapplication/x-patch; name=v6-0001-Optimize-and-simplify-nbtree-backwards-scans.patchDownload

From 17959762e34cf8ee75af37f98dc5b085d05c7860 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 8 Oct 2024 11:40:54 -0400
Subject: [PATCH v6] Optimize and simplify nbtree backwards scans.

This enhancement completely supersedes the one added by commit 3f44959f.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
---
 src/include/access/nbtree.h           |  23 +-
 src/backend/access/nbtree/nbtree.c    |  77 ++-
 src/backend/access/nbtree/nbtsearch.c | 744 ++++++++++++--------------
 src/backend/access/nbtree/nbtutils.c  |   2 +-
 4 files changed, 401 insertions(+), 445 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 90a68375a..22bf38934 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -954,6 +954,7 @@ typedef struct BTScanPosData
 
 	XLogRecPtr	lsn;			/* pos in the WAL stream when page was read */
 	BlockNumber currPage;		/* page referenced by items array */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 	BlockNumber nextPage;		/* page's right link when we scanned it */
 
 	/*
@@ -965,15 +966,6 @@ typedef struct BTScanPosData
 	bool		moreLeft;
 	bool		moreRight;
 
-	/*
-	 * Direction of the scan at the time that _bt_readpage was called.
-	 *
-	 * Used by btrestrpos to "restore" the scan's array keys by resetting each
-	 * array to its first element's value (first in this scan direction). This
-	 * avoids the need to directly track the array keys in btmarkpos.
-	 */
-	ScanDirection dir;
-
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
 	 * location in the associated tuple storage workspace.
@@ -1022,6 +1014,7 @@ typedef BTScanPosData *BTScanPos;
 #define BTScanPosInvalidate(scanpos) \
 	do { \
 		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
 		(scanpos).nextPage = InvalidBlockNumber; \
 		(scanpos).buf = InvalidBuffer; \
 		(scanpos).lsn = InvalidXLogRecPtr; \
@@ -1073,6 +1066,7 @@ typedef struct BTScanOpaqueData
 	 * markPos.
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
+	ScanDirection markDir;		/* direction of scan with mark */
 
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
@@ -1091,7 +1085,6 @@ typedef struct BTReadPageState
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
-	BlockNumber prev_scan_page; /* previous _bt_parallel_release block */
 	Page		page;			/* Page being read */
 
 	/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
@@ -1192,12 +1185,14 @@ extern int	btgettreeheight(Relation rel);
 /*
  * prototypes for internal functions in nbtree.c
  */
-extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
-							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
+							   BlockNumber *curr_page, bool first);
+extern void _bt_parallel_release(IndexScanDesc scan,
+								 BlockNumber next_scan_page,
+								 BlockNumber curr_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
-										   BlockNumber prev_scan_page);
+										   BlockNumber curr_page);
 
 /*
  * prototypes for functions in nbtdedup.c
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9b968aa0f..ff4fa2cf9 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -66,7 +66,9 @@ typedef enum
  */
 typedef struct BTParallelScanDescData
 {
-	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BlockNumber btps_nextScanPage;	/* next page to be scanned */
+	BlockNumber btps_lastCurrPage;	/* page whose sibling link was copied into
+									 * btps_scanPage */
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
@@ -522,7 +524,7 @@ btrestrpos(IndexScanDesc scan)
 			/* Reset the scan's array keys (see _bt_steppage for why) */
 			if (so->numArrayKeys)
 			{
-				_bt_start_array_keys(scan, so->currPos.dir);
+				_bt_start_array_keys(scan, so->markDir);
 				so->needPrimScan = false;
 			}
 		}
@@ -550,7 +552,8 @@ btinitparallelscan(void *target)
 	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
 
 	SpinLockInit(&bt_target->btps_mutex);
-	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_nextScanPage = InvalidBlockNumber;
+	bt_target->btps_lastCurrPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
@@ -575,7 +578,8 @@ btparallelrescan(IndexScanDesc scan)
 	 * consistency.
 	 */
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_nextScanPage = InvalidBlockNumber;
+	btscan->btps_lastCurrPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
 	SpinLockRelease(&btscan->btps_mutex);
 }
@@ -591,18 +595,21 @@ btparallelrescan(IndexScanDesc scan)
  * start just yet (only backends that call from _bt_first are capable of
  * starting primitive index scans, which they indicate by passing first=true).
  *
- * If the return value is true, *pageno returns the next or current page
- * of the scan (depending on the scan direction).  An invalid block number
- * means the scan hasn't yet started, or that caller needs to start the next
- * primitive index scan (if it's the latter case we'll set so.needPrimScan).
- * The first time a participating process reaches the last page, it will return
- * true and set *pageno to P_NONE; after that, further attempts to seize the
- * scan will return false.
+ * If the return value is true, *next_scan_page returns the next page of the
+ * scan, and *curr_page returns the page whose sibling link *next_scan_page
+ * came from.  An invalid *next_scan_page means the scan hasn't yet started,
+ * or that caller needs to start the next primitive index scan (if it's the
+ * latter case we'll set so.needPrimScan).  The first time a participating
+ * process reaches the last page, it will return true and set *next_scan_page
+ * to P_NONE; after that, further attempts to seize the scan will return
+ * false.
  *
- * Callers should ignore the value of pageno if the return value is false.
+ * Callers should ignore the value of next_scan_page and curr_page if the
+ * return value is false.
  */
 bool
-_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
+				   BlockNumber *curr_page, bool first)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		exit_loop = false;
@@ -610,7 +617,17 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 
-	*pageno = P_NONE;
+	*next_scan_page = P_NONE;
+	*curr_page = InvalidBlockNumber;
+
+	/*
+	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
+	 * call to _bt_readnextpage treats this backend similarly to a serial
+	 * backend that steps from *curr_page to *next_scan_page (unless this
+	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
+	 */
+	BTScanPosInvalidate(so->currPos);
+	so->currPos.moreLeft = so->currPos.moreRight = true;
 
 	if (first)
 	{
@@ -660,7 +677,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 					array->cur_elem = btscan->btps_arrElems[i];
 					skey->sk_argument = array->elem_values[array->cur_elem];
 				}
-				*pageno = InvalidBlockNumber;
+				*next_scan_page = InvalidBlockNumber;
 				exit_loop = true;
 			}
 			else
@@ -688,7 +705,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 * of advancing it to a new page!
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
-			*pageno = btscan->btps_scanPage;
+			*next_scan_page = btscan->btps_nextScanPage;
+			*curr_page = btscan->btps_lastCurrPage;
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -703,17 +721,20 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 
 /*
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
- *		new page.  We now have the new value btps_scanPage; some other backend
+ *		new page.  We now have the new value btps_nextScanPage; another backend
  *		can now begin advancing the scan.
  *
- * Callers whose scan uses array keys must save their scan_page argument so
+ * Callers whose scan uses array keys must save their curr_page argument so
  * that it can be passed to _bt_parallel_primscan_schedule, should caller
- * determine that another primitive index scan is required.  If that happens,
- * scan_page won't be scanned by any backend (unless the next primitive index
- * scan lands on scan_page).
+ * determine that another primitive index scan is required.
+ *
+ * Note: unlike the serial case, parallel scans don't need to remember both
+ * sibling links.  next_scan_page is whichever link is next given the scan's
+ * direction, since its direction cannot change when parallelism is in use.
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
+					 BlockNumber curr_page)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -722,7 +743,8 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_scanPage = scan_page;
+	btscan->btps_nextScanPage = next_scan_page;
+	btscan->btps_lastCurrPage = curr_page;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
@@ -778,13 +800,13 @@ _bt_parallel_done(IndexScanDesc scan)
 /*
  * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Caller passes the block number most recently passed to _bt_parallel_release
+ * Caller passes the curr_page most recently passed to _bt_parallel_release
  * by its backend.  Caller successfully schedules the next primitive index scan
  * if the shared parallel state hasn't been seized since caller's backend last
  * advanced the scan.
  */
 void
-_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
+_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -796,10 +818,11 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_scanPage == prev_scan_page &&
+	if (btscan->btps_lastCurrPage == curr_page &&
 		btscan->btps_pageStatus == BTPARALLEL_IDLE)
 	{
-		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_nextScanPage = InvalidBlockNumber;
+		btscan->btps_lastCurrPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
 
 		/* Serialize scan's current array keys */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 91ac6533f..8912f17df 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -43,12 +43,13 @@ static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
-static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
-								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
+static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
+							  ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+							 BlockNumber lastcurrblkno, ScanDirection dir);
+static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
+										 BlockNumber lastcurrblkno);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -880,7 +881,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Buffer		buf;
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
@@ -889,10 +889,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysz = 0;
 	int			i;
-	bool		status;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
-	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -924,7 +922,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL)
 	{
-		status = _bt_parallel_seize(scan, &blkno, true);
+		BlockNumber blkno,
+					lastcurrblkno;
+		bool		status;
+
+		status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, true);
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -942,7 +944,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			Assert(!so->needPrimScan);
+
+			/*
+			 * We anticipated starting another primitive scan, but some other
+			 * worker bet us to it
+			 */
+			if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir))
 				return false;
 			goto readcomplete;
 		}
@@ -1392,12 +1400,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(buf))
+	if (!BufferIsValid(so->currPos.buf))
 	{
 		/*
 		 * We only get here if the index is completely empty. Lock relation
@@ -1411,11 +1419,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(buf))
+		if (!BufferIsValid(so->currPos.buf))
 		{
 			/*
 			 * Mark parallel scan as done, so that all the workers can finish
@@ -1427,17 +1435,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
-	PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
-
-	_bt_initialize_more_data(so, dir);
-
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, buf);
-	Assert(!BTScanPosIsValid(so->currPos));
-	so->currPos.buf = buf;
+	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
 
 	/*
-	 * Now load data from the first page of the scan.
+	 * Now load data from the first page of the scan (usually but not always
+	 * the page currently in so->currPos.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1455,21 +1458,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readpage(scan, dir, offnum, true))
-	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
-		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
-			return false;
-	}
-	else
-	{
-		/* We have at least one item to return as scan's first item */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-	}
+	if (!_bt_readfirstpage(scan, offnum, dir))
+		return false;
 
 readcomplete:
 	/* OK, itemIndex says what to return */
@@ -1545,14 +1535,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
  *
- * _bt_first caller passes us an offnum returned by _bt_binsrch, which might
- * be an out of bounds offnum such as "maxoff + 1" in certain corner cases.
- * _bt_checkkeys will stop the scan as soon as an equality qual fails (when
- * its scan key was marked required), so _bt_first _must_ pass us an offnum
- * exactly at the beginning of where equal tuples are to be found.  When we're
- * passed an offnum past the end of the page, we might still manage to stop
- * the scan on this page by calling _bt_checkkeys against the high key.
- *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
  * _bt_parallel_release before returning.
@@ -1563,6 +1545,7 @@ static bool
 _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			 bool firstPage)
 {
+	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
@@ -1581,19 +1564,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
+	Assert(!P_IGNORE(opaque));
 
-	/* allow next page be processed by parallel worker */
-	if (scan->parallel_scan)
-	{
-		if (ScanDirectionIsForward(dir))
-			pstate.prev_scan_page = opaque->btpo_next;
-		else
-			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
-
-		_bt_parallel_release(scan, pstate.prev_scan_page);
-	}
-
-	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
+	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	arrayKeys = so->numArrayKeys != 0;
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -1615,8 +1588,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	/*
 	 * We note the buffer's block number so that we can release the pin later.
 	 * This allows us to re-read the buffer if it is needed again for hinting.
+	 * Also save the page's sibling links; this tells us where to step
+	 * right/left to after we're done reading items from this page.
 	 */
 	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+	so->currPos.prevPage = opaque->btpo_prev;
+	so->currPos.nextPage = opaque->btpo_next;
 
 	/*
 	 * We save the LSN of the page as we read it, so that we know whether it
@@ -1625,13 +1602,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
 
-	/*
-	 * we must save the page's right-link while scanning it; this tells us
-	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
-	 */
-	so->currPos.nextPage = opaque->btpo_next;
-
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
 
@@ -1641,6 +1611,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(BTScanPosIsPinned(so->currPos));
 
+	/* allow next page to be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, so->currPos.nextPage,
+								 so->currPos.currPage);
+		else
+			_bt_parallel_release(scan, so->currPos.prevPage,
+								 so->currPos.currPage);
+	}
+
+	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+
 	/*
 	 * Prechecking the value of the continuescan flag for the last item on the
 	 * page (for backwards scan it will be the first item on a page).  If we
@@ -1829,7 +1812,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
-			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+			truncatt = BTreeTupleGetNAtts(itup, rel);
 			pstate.prechecked = false;	/* precheck didn't cover HIKEY */
 			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
@@ -1959,7 +1942,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * We don't need to visit page to the left when no more matches will
 		 * be found there
 		 */
-		if (!pstate.continuescan || P_LEFTMOST(opaque))
+		if (!pstate.continuescan)
 			so->currPos.moreLeft = false;
 
 		Assert(itemIndex >= 0);
@@ -2060,20 +2043,25 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
- * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
- * if pinned, we'll drop the pin before moving to next page.  The buffer is
- * not locked on entry.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked.
+ * The scan must have a valid so->currPos position in any case.  If there's no
+ * pin held, then that must be because _bt_drop_lock_and_maybe_pin dropped the
+ * pin early as an optimization.
  *
- * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
- * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
- * to InvalidBuffer.  We return true to indicate success.
+ * This is a wrapper on _bt_readnextpage that performs final steps for the
+ * current page.  It sets up the _bt_readnextpage call using either local
+ * state saved in so->currPos by the most recent _bt_readpage call, or using
+ * shared parallel scan state (obtained by seizing the parallel scan here).
+ *
+ * Parallel scan callers that have already seized the scan should directly
+ * call _bt_readnextpage, rather than calling here.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno = InvalidBlockNumber;
-	bool		status;
+	BlockNumber blkno,
+				lastcurrblkno;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -2098,6 +2086,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				   so->currPos.nextTupleOffset);
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
+		so->markDir = dir;
 
 		/*
 		 * If we're just about to start the next primitive index scan
@@ -2115,7 +2104,6 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		 * In effect, btrestpos leaves advancing the arrays up to the first
 		 * _bt_readpage call (that takes place after it has restored markPos).
 		 */
-		Assert(so->markPos.dir == dir);
 		if (so->needPrimScan)
 		{
 			if (ScanDirectionIsForward(dir))
@@ -2123,319 +2111,302 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			else
 				so->markPos.moreLeft = true;
 		}
+
+		/* mark/restore not supported by parallel scans */
+		Assert(!scan->parallel_scan);
 	}
 
-	if (ScanDirectionIsForward(dir))
+	/* release the pin when _bt_drop_lock_and_maybe_pin couldn't earlier on */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	/* Walk to the next page with data */
+	if (!scan->parallel_scan)
 	{
-		/* Walk right to the next page with data */
-		if (scan->parallel_scan != NULL)
-		{
-			/*
-			 * Seize the scan to get the next block number; if the scan has
-			 * ended already, bail out.
-			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			if (!status)
-			{
-				/* release the previous buffer, if pinned */
-				BTScanPosUnpinIfPinned(so->currPos);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-		}
-		else
-		{
-			/* Not parallel, so use the previously-saved nextPage link. */
+		/* Not parallel, so use local state set by the last _bt_readpage */
+		if (ScanDirectionIsForward(dir))
 			blkno = so->currPos.nextPage;
-		}
-
-		/* Remember we left a page with data */
-		so->currPos.moreLeft = true;
-
-		/* release the previous buffer, if pinned */
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-	else
-	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
-
-		if (scan->parallel_scan != NULL)
-		{
-			/*
-			 * Seize the scan to get the current block number; if the scan has
-			 * ended already, bail out.
-			 */
-			status = _bt_parallel_seize(scan, &blkno, false);
-			BTScanPosUnpinIfPinned(so->currPos);
-			if (!status)
-			{
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-		}
 		else
-		{
-			/* Not parallel, so just use our own notion of the current page */
-			blkno = so->currPos.currPage;
-		}
-	}
-
-	if (!_bt_readnextpage(scan, blkno, dir))
-		return false;
-
-	/* We have at least one item to return as scan's next item */
-	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-
-	return true;
-}
-
-/*
- *	_bt_readnextpage() -- Read next page containing valid data for scan
- *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  Caller must release the lock (and
- * maybe the pin) on the buffer on success exit.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
- */
-static bool
-_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
-	bool		status;
-
-	rel = scan->indexRelation;
-
-	if (ScanDirectionIsForward(dir))
-	{
-		for (;;)
-		{
-			/*
-			 * if we're at end of scan, give up and mark parallel scan as
-			 * done, so that all the workers can finish their scan
-			 */
-			if (blkno == P_NONE || !so->currPos.moreRight)
-			{
-				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-			/* check for interrupts while we're not holding any buffer lock */
-			CHECK_FOR_INTERRUPTS();
-			/* step right one page */
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			page = BufferGetPage(so->currPos.buf);
-			opaque = BTPageGetOpaque(page);
-			/* check for deleted page */
-			if (!P_IGNORE(opaque))
-			{
-				PredicateLockPage(rel, blkno, scan->xs_snapshot);
-				/* see if there are any matches on this page */
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), false))
-					break;
-			}
-			else if (scan->parallel_scan != NULL)
-			{
-				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
-			}
-
-			/* nope, keep going */
-			if (scan->parallel_scan != NULL)
-			{
-				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
-				if (!status)
-				{
-					BTScanPosInvalidate(so->currPos);
-					return false;
-				}
-			}
-			else
-			{
-				blkno = opaque->btpo_next;
-				_bt_relbuf(rel, so->currPos.buf);
-			}
-		}
+			blkno = so->currPos.prevPage;
+		lastcurrblkno = so->currPos.currPage;
 	}
 	else
 	{
 		/*
-		 * Should only happen in parallel cases, when some other backend
-		 * advanced the scan.
+		 * Seize the scan to get the nextPage and currPage from shared
+		 * parallel state (saved from parallel scan's last _bt_readpage)
 		 */
-		if (so->currPos.currPage != blkno)
-		{
-			BTScanPosUnpinIfPinned(so->currPos);
-			so->currPos.currPage = blkno;
-		}
+		if (!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return false;
+	}
 
-		/* Done if we know that the left sibling link isn't of interest */
-		if (!so->currPos.moreLeft)
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir);
+}
+
+/*
+ *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
+ *
+ * _bt_first caller passes us an offnum returned by _bt_binsrch, which might
+ * be an out of bounds offnum such as "maxoff + 1" in certain corner cases.
+ * _bt_checkkeys will stop the scan as soon as an equality qual fails (when
+ * its scan key was marked required), so _bt_first _must_ pass us an offnum
+ * exactly at the beginning of where equal tuples are to be found.  When we're
+ * passed an offnum past the end of the page, we might still manage to stop
+ * the scan on this page by calling _bt_checkkeys against the high key.  See
+ * _bt_readpage for full details.
+ *
+ * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * Parallel scan callers must have seized the scan before calling here.
+ *
+ * On exit, we'll have updated so->currPos and retained locks and pins
+ * according to the same rules as those laid out for _bt_readnextpage exit.
+ * Like _bt_readnextpage, our return value indicates if there are any matching
+ * records in the given direction.
+ *
+ * We always release the scan for a parallel scan caller, regardless of
+ * success or failure; we'll call _bt_parallel_release as soon as possible.
+ */
+static bool
+_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(BufferIsValid(so->currPos.buf));
+
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+
+	/* Initialize currPos for so->currPos */
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+
+	/*
+	 * Attempt to load matching tuples from the page in so->currPos.buf.
+	 *
+	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * _bt_readpage also releases parallel scan (even when it returns false).
+	 */
+	if (_bt_readpage(scan, dir, offnum, true))
+	{
+		/*
+		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+		 * so->currPos.buf in preparation for btgettuple returning tuples.
+		 */
+		Assert(BTScanPosIsPinned(so->currPos));
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		return true;
+	}
+
+	/* There's no actually-matching data on the page in so->currPos.buf */
+	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+
+	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
+	if (!_bt_steppage(scan, dir))
+		return false;
+
+	/* _bt_readpage for a later page (now in so->currPos) succeeded */
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for _bt_next
+ *
+ * Caller's blkno is the next interesting page's link, taken from either the
+ * previously-saved right link or left link.  lastcurrblkno is the page that
+ * was current at the point where the blkno link was saved, which we use to
+ * reason about concurrent page splits/page deletions during backwards scans
+ * (_bt_parallel_seize also requires it, regardless of scan direction).
+ *
+ * On entry, caller shouldn't hold any locks or pins on any page (we work
+ * directly off of blkno and lastcurrblkno instead).  Parallel scan callers
+ * must have seized the scan before calling here (blkno and lastcurrblkno
+ * arguments should come from the seized scan).
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page, and we return true.  We hold a pin on the buffer on
+ * success exit, except for scans that _bt_drop_lock_and_maybe_pin indicates
+ * can safely have their pin dropped to avoid blocking progress by VACUUM.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, and return false.
+ *
+ * We always release the scan for a parallel scan caller, regardless of
+ * success or failure; we'll call _bt_parallel_release as soon as possible.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+				 BlockNumber lastcurrblkno, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+
+	/*
+	 * Invalidate currPos, to make it safe for us to return 'false'.
+	 *
+	 * Note: this won't unset moreLeft and moreRight from lastcurrblkno.
+	 */
+	BTScanPosInvalidate(so->currPos);
+
+	/*
+	 * Prepare for blkno to become so->currPos.currPage in _bt_readpage.
+	 *
+	 * We know that the scan just read a page (lastcurrblkno), so remember
+	 * that we left a place with potentially matching tuples to blkno's left
+	 * (or to blkno's right, when we're scanning backwards).
+	 *
+	 * Note: we deliberately avoid updating moreRight (or moreLeft, during
+	 * backwards scans) at this point.  When there's no more data to the right
+	 * of lastcurrblkno (during a forward scan), then clearly there also can't
+	 * be any more data to the right of its blkno right sibling page now.
+	 * (During backwards scans there can't be any more data to the left of its
+	 * blkno left sibling page.)
+	 */
+	if (ScanDirectionIsForward(dir))
+		so->currPos.moreLeft = true;
+	else
+		so->currPos.moreRight = true;
+
+	for (;;)
+	{
+		/*
+		 * if we're at end of scan, give up and mark parallel scan as done, so
+		 * that all the workers can finish their scan
+		 */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !so->currPos.moreRight : !so->currPos.moreLeft))
 		{
-			BTScanPosUnpinIfPinned(so->currPos);
 			_bt_parallel_done(scan);
-			BTScanPosInvalidate(so->currPos);
 			return false;
 		}
 
-		/*
-		 * Walk left to the next page with data.  This is much more complex
-		 * than the walk-right case because of the possibility that the page
-		 * to our left splits while we are in flight to it, plus the
-		 * possibility that the page we were on gets deleted after we leave
-		 * it.  See nbtree/README for details.
-		 *
-		 * It might be possible to rearrange this code to have less overhead
-		 * in pinning and locking, but that would require capturing the left
-		 * sibling block number when the page is initially read, and then
-		 * optimistically starting there (rather than pinning the page twice).
-		 * It is not clear that this would be worth the complexity.
-		 */
-		if (BTScanPosIsPinned(so->currPos))
-			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
-		else
-			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
-
-		for (;;)
+		if (ScanDirectionIsForward(dir))
 		{
-			/* Done if we know that the left sibling link isn't of interest */
-			if (!so->currPos.moreLeft)
-			{
-				_bt_relbuf(rel, so->currPos.buf);
-				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
-				return false;
-			}
-
-			/* Step to next physical page */
-			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf);
-
-			/* if we're physically at end of index, return failure */
+			/* read blkno, but check for interrupts first */
+			CHECK_FOR_INTERRUPTS();
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+		}
+		else
+		{
+			/* read blkno, avoiding race (also checks for interrupts) */
+			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
+														 lastcurrblkno);
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				/* must have been a concurrent deletion of leftmost page */
 				_bt_parallel_done(scan);
-				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
+		}
 
-			/*
-			 * Okay, we managed to move left to a non-deleted page. Done if
-			 * it's not half-dead and contains matching tuples. Else loop back
-			 * and do it all again.
-			 */
-			page = BufferGetPage(so->currPos.buf);
-			opaque = BTPageGetOpaque(page);
-			if (!P_IGNORE(opaque))
+		page = BufferGetPage(so->currPos.buf);
+		opaque = BTPageGetOpaque(page);
+		lastcurrblkno = blkno;
+		if (!P_IGNORE(opaque))
+		{
+			/* see if there are any matches on this page */
+			if (ScanDirectionIsForward(dir))
+			{
+				/* note that this will clear moreRight if we can stop */
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), false))
+					break;
+				blkno = so->currPos.nextPage;
+			}
+			else
 			{
-				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
-				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), false))
 					break;
-			}
-			else if (scan->parallel_scan != NULL)
-			{
-				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
-			}
-
-			/*
-			 * For parallel scans, get the last page scanned as it is quite
-			 * possible that by the time we try to seize the scan, some other
-			 * worker has already advanced the scan to a different page.  We
-			 * must continue based on the latest page scanned by any worker.
-			 */
-			if (scan->parallel_scan != NULL)
-			{
-				_bt_relbuf(rel, so->currPos.buf);
-				status = _bt_parallel_seize(scan, &blkno, false);
-				if (!status)
-				{
-					BTScanPosInvalidate(so->currPos);
-					return false;
-				}
-				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+				blkno = so->currPos.prevPage;
 			}
 		}
+		else
+		{
+			/* _bt_readpage not called, so do all this for ourselves */
+			if (ScanDirectionIsForward(dir))
+				blkno = opaque->btpo_next;
+			else
+				blkno = opaque->btpo_prev;
+			if (scan->parallel_scan)
+				_bt_parallel_release(scan, blkno, lastcurrblkno);
+		}
+
+		/* no matching tuples on this page */
+		_bt_relbuf(rel, so->currPos.buf);
+
+		/* working off of new blkno now */
+		BTScanPosInvalidate(so->currPos);
+
+		/* parallel scan seizes another page (don't use so->currPos blkno) */
+		if (scan->parallel_scan &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return false;
 	}
 
-	return true;
-}
-
-/*
- *	_bt_parallel_readpage() -- Read current page containing valid data for scan
- *
- * On success, release lock and maybe pin on buffer.  We return true to
- * indicate success.
- */
-static bool
-_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(!so->needPrimScan);
-
-	_bt_initialize_more_data(so, dir);
-
-	if (!_bt_readnextpage(scan, blkno, dir))
-		return false;
-
-	/* We have at least one item to return as scan's next item */
+	/*
+	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+	 * so->currPos.buf in preparation for btgettuple returning tuples.
+	 */
+	Assert(so->currPos.currPage == blkno);
+	Assert(BTScanPosIsPinned(so->currPos));
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
 	return true;
 }
 
 /*
- * _bt_walk_left() -- step left one page, if possible
+ * _bt_lock_and_validate_left() -- lock caller's left sibling blkno,
+ * recovering from concurrent page splits/page deletions when necessary
  *
- * The given buffer must be pinned and read-locked.  This will be dropped
- * before stepping left.  On return, we have pin and read lock on the
- * returned page, instead.
+ * Called during backwards scans, to deal with their unique concurrency rules.
  *
- * Returns InvalidBuffer if there is no page to the left (no lock is held
- * in that case).
+ * blkno points to the block number of the page that we expect to move the
+ * scan to.  We'll successfully move the scan there when we find that its
+ * right sibling link points to lastcurrblkno, the page that we just finished
+ * reading.  Otherwise, we have to figure out which page is the correct one to
+ * visit next the hard way, which involves reasoning about both concurrent
+ * page splits and concurrent page deletions.  See nbtree/README for details.
+ *
+ * On return, we have both a pin and a read lock on the returned page, whose
+ * block number will be set in *blkno.  Returns InvalidBuffer if there is no
+ * page to the left (no lock is held in that case).
  *
  * It is possible for the returned leaf page to be half-dead; caller must
  * check that condition and step left again when required.
  */
 static Buffer
-_bt_walk_left(Relation rel, Buffer buf)
+_bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
+						   BlockNumber lastcurrblkno)
 {
-	Page		page;
-	BTPageOpaque opaque;
-
-	page = BufferGetPage(buf);
-	opaque = BTPageGetOpaque(page);
+	BlockNumber origblkno = *blkno; /* detects circular links */
 
 	for (;;)
 	{
-		BlockNumber obknum;
-		BlockNumber lblkno;
-		BlockNumber blkno;
+		Buffer		buf;
+		Page		page;
+		BTPageOpaque opaque;
 		int			tries;
 
-		/* if we're at end of tree, release buf and return failure */
-		if (P_LEFTMOST(opaque))
-		{
-			_bt_relbuf(rel, buf);
-			break;
-		}
-		/* remember original page we are stepping left from */
-		obknum = BufferGetBlockNumber(buf);
-		/* step left */
-		blkno = lblkno = opaque->btpo_prev;
-		_bt_relbuf(rel, buf);
 		/* check for interrupts while we're not holding any buffer lock */
 		CHECK_FOR_INTERRUPTS();
-		buf = _bt_getbuf(rel, blkno, BT_READ);
+		buf = _bt_getbuf(rel, *blkno, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 
@@ -2452,21 +2423,26 @@ _bt_walk_left(Relation rel, Buffer buf)
 		tries = 0;
 		for (;;)
 		{
-			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
+			if (!P_ISDELETED(opaque) && opaque->btpo_next == lastcurrblkno)
 			{
 				/* Found desired page, return it */
 				return buf;
 			}
 			if (P_RIGHTMOST(opaque) || ++tries > 4)
 				break;
-			blkno = opaque->btpo_next;
-			buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
+			/* step right */
+			*blkno = opaque->btpo_next;
+			buf = _bt_relandgetbuf(rel, buf, *blkno, BT_READ);
 			page = BufferGetPage(buf);
 			opaque = BTPageGetOpaque(page);
 		}
 
-		/* Return to the original page to see what's up */
-		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
+		/*
+		 * Return to the original page (usually the page most recently read by
+		 * _bt_readpage, which is passed by caller as lastcurrblkno) to see
+		 * what's up with sibling links
+		 */
+		buf = _bt_relandgetbuf(rel, buf, lastcurrblkno, BT_READ);
 		page = BufferGetPage(buf);
 		opaque = BTPageGetOpaque(page);
 		if (P_ISDELETED(opaque))
@@ -2482,8 +2458,8 @@ _bt_walk_left(Relation rel, Buffer buf)
 				if (P_RIGHTMOST(opaque))
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
-				blkno = opaque->btpo_next;
-				buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
+				*blkno = opaque->btpo_next;
+				buf = _bt_relandgetbuf(rel, buf, *blkno, BT_READ);
 				page = BufferGetPage(buf);
 				opaque = BTPageGetOpaque(page);
 				if (!P_ISDELETED(opaque))
@@ -2491,9 +2467,10 @@ _bt_walk_left(Relation rel, Buffer buf)
 			}
 
 			/*
-			 * Now return to top of loop, resetting obknum to point to this
-			 * nondeleted page, and try again.
+			 * remember that this page is actually now the leftmost one whose
+			 * key space we've already read all tuples from
 			 */
+			lastcurrblkno = BufferGetBlockNumber(buf);
 		}
 		else
 		{
@@ -2502,11 +2479,22 @@ _bt_walk_left(Relation rel, Buffer buf)
 			 * to the left got split or deleted. Without this check, we'd go
 			 * into an infinite loop if there's anything wrong.
 			 */
-			if (opaque->btpo_prev == lblkno)
+			if (opaque->btpo_prev == origblkno)
 				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
-					 obknum, RelationGetRelationName(rel));
-			/* Okay to try again with new lblkno value */
+					 lastcurrblkno, RelationGetRelationName(rel));
+			/* Okay to try again, since left sibling link changed */
 		}
+
+		if (P_LEFTMOST(opaque))
+		{
+			/* previous leftmost page must have been concurrently deleted */
+			_bt_relbuf(rel, buf);
+			break;
+		}
+
+		/* next loop iteration will start over with a clean slate */
+		*blkno = origblkno = opaque->btpo_prev;
+		_bt_relbuf(rel, buf);
 	}
 
 	return InvalidBuffer;
@@ -2606,7 +2594,6 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Buffer		buf;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
@@ -2617,9 +2604,9 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	 * version of _bt_search().  We don't maintain a stack since we know we
 	 * won't need it.
 	 */
-	buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(buf))
+	if (!BufferIsValid(so->currPos.buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2630,8 +2617,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
-	page = BufferGetPage(buf);
+	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2654,29 +2640,11 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		start = 0;				/* keep compiler quiet */
 	}
 
-	/* remember which buffer we have pinned */
-	so->currPos.buf = buf;
-
-	_bt_initialize_more_data(so, dir);
-
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start, true))
-	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
-		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
-			return false;
-	}
-	else
-	{
-		/* We have at least one item to return as scan's first item */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
-	}
+	if (!_bt_readfirstpage(scan, start, dir))
+		return false;
 
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
@@ -2686,33 +2654,3 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
-
-/*
- * _bt_initialize_more_data() -- initialize moreLeft, moreRight and scan dir
- * from currPos
- */
-static inline void
-_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
-{
-	so->currPos.dir = dir;
-	if (so->needPrimScan)
-	{
-		Assert(so->numArrayKeys);
-
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
-		so->needPrimScan = false;
-	}
-	else if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 61ded00ca..76be65123 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2407,7 +2407,7 @@ new_prim_scan:
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, pstate->prev_scan_page);
+		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
-- 
2.45.2

#12

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 year ago

In reply to: Peter Geoghegan (#11)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Wed, 16 Oct 2024 at 20:03, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Oct 11, 2024 at 7:29 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v5

Now I'm attaching a v6, which further polishes things. Current plan is to
commit something very close to this in the next day or two.

v6 is mostly just further comment polishing. But it also merges together
the two existing _bt_readnextpage loops (for forward and backward scan
directions) into one single loop that handles everything. This is
definitely a win for brevity, and might also be a win for clarity --
but I'm not 100% sure which way I prefer it just yet.

(patch v6)

-    BlockNumber btps_scanPage;    /* latest or next page to be scanned */
+    BlockNumber btps_nextScanPage;    /* next page to be scanned */
+    BlockNumber btps_lastCurrPage;    /* page whose sibling link was copied into
+                                     * btps_scanPage */

This reference to btps_scanPage in the comment needs to be updated to
its new name.

(from your v5 mail)

* Now nbtree has only one PredicateLockPage call, inside _bt_readpage.
This saves us an extra BufferGetBlockNumber call. (v4 pushed
PredicateLockPage calls down one layer, v5 pushes them down another
layer still.)

(and seen in patch v6)

+    /* allow next page to be processed by parallel worker */
+    if (scan->parallel_scan)
+    {
+        if (ScanDirectionIsForward(dir))
+            _bt_parallel_release(scan, so->currPos.nextPage,
+                                 so->currPos.currPage);
+        else
+            _bt_parallel_release(scan, so->currPos.prevPage,
+                                 so->currPos.currPage);
+    }
+
+    PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);

I'm a little bit concerned about this change: I'm not familiar with
predicate locks, PLP, or anything related to SSI, but previously we
called PLP in parallel scans while we still had hold of the scan,
while we now call PLP only after letting other backends do things,
allowing PLPs for this scan to happen concurrently with other backends
of this scan. In principle, that looks like an improvement in
parallelism by reducing the work done in the critical path.

However, before, we would always get predicate locks in ascending (or
descending for backward scans) value order, but now that strict
keyspace order access has been released an unfortunately timed
descheduling of the process between _bt_p_release and PLP's guts means
mean lock acquisition would be only approximately in index leaf page
order. If the lock acquisition order matters in the predicate lock
system, then this will likely be a new cause of lock
conflicts/deadlocks.

If that's a real issue (by which I mean predicate locking is indeed
sensitive to acquisition order) then this call to PredicateLockPage
must happen before _bt_p_release, so that users don't get conflicts
caused only by bad timing issues in single-directional index scans.

Apart from these two comments, LGTM.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#13

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Matthias van de Meent (#12)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Thu, Oct 17, 2024 at 5:13 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I'm a little bit concerned about this change: I'm not familiar with
predicate locks, PLP, or anything related to SSI, but previously we
called PLP in parallel scans while we still had hold of the scan,
while we now call PLP only after letting other backends do things,
allowing PLPs for this scan to happen concurrently with other backends
of this scan. In principle, that looks like an improvement in
parallelism by reducing the work done in the critical path.

It sort of is, but not really. FWIW, there were two thoughts behind this:

1. Try to avoid gratuitous BufferGetBlockNumber() calls (use a stashed
BlockNumber instead), since nbtree has a tendency to do too many of
those (something that Andres complained about not long ago).

2. Try to do as little as possible while the backend has seized the
scan, on general principle. If only to set a good example going
forward -- loudly advertising when the scan is seized should also help.

I doubt it matters very much if we call PredicateLockPage() while the
parallel scan is seized -- I wouldn't expect avoiding it (waiting
until the parallel scan has been released to call PredicateLockPage)
will measurably help performance. But I can't see how it could
possibly be incorrect, either.

There is no reason to think that the precise order that we call
PredicateLockPage() for each page matters, in general. Just as it
doesn't matter if the index is scanned forwards or backwards.

Apart from these two comments, LGTM.

Pushed something very close to v6 several hours ago.

Thanks
--
Peter Geoghegan

#14

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 1 year ago

In reply to: Peter Geoghegan (#10)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

Hi, thanks for working on these improvements.

I noticed an unexpected behavior where the index scan continues instead
of
stopping, even when it detects that there are no tuples that match the
conditions.
(I observed this while reviewing the skip scan patch, though it isn't
directly
related to this issue.)

On 2024-10-12 08:29, Peter Geoghegan wrote:

On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
* We now reset currPos state (including even its moreLeft/moreRight
fields) within _bt_parallel_seize, automatically and regardless of any
other details.

IIUC, the above change is the root cause. The commit 1bd4bc8 adds a
reset of
the currPos state in _bt_parallel_seize(). However, this change can
overwrite
currPos.moreRight which should be preserved before calling
_bt_readnextpage().

* Test case

-- Prepare
DROP TABLE IF EXISTS test;
CREATE TABLE test (smallint smallint, bool bool);
INSERT INTO test (SELECT -20000+i%40000, random()>0.5 FROM
generate_series(1, 1_000_000) s(i));
CREATE INDEX test_smallint ON test (smallint);
VACUUM ANALYZE test;

-- Check the number of pages of the index
=# SELECT relpages FROM pg_class WHERE relname = 'test_smallint';
relpages
----------
937
(1 row)

-- Test
=# SET max_parallel_workers = 0;
=# EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT COUNT(*) FROM test WHERE
smallint < -10000;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=5170.23..5170.24 rows=1 width=8) (actual
time=71.352..71.402 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=934
-> Gather (cost=5170.01..5170.22 rows=2 width=8) (actual
time=71.344..71.395 rows=1 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 0
Buffers: shared hit=934
-> Partial Aggregate (cost=4170.01..4170.02 rows=1 width=8)
(actual time=71.199..71.199 rows=1 loops=1)
Output: PARTIAL count(*)
Buffers: shared hit=934
-> Parallel Index Only Scan using test_smallint on
public.test (cost=0.42..3906.27 rows=105495 width=0) (actual
time=0.062..49.137 rows=250000 loops=1)
Output: "smallint"
Index Cond: (test."smallint" < '-10000'::integer)
Heap Fetches: 0
Buffers: shared hit=934 -- This is not the result
I expected. Almost all relpages are being read to retrieve only 25% of
the tuples.
-- Without commit 1bd4bc8,
the number was '236' in my environment.
Planning Time: 0.105 ms
Execution Time: 71.454 ms
(18 rows)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#15

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Masahiro Ikeda (#14)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Thu, Nov 7, 2024 at 5:44 AM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:0-12 08:29, Peter Geoghegan wrote:

On Thu, Oct 10, 2024 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
* We now reset currPos state (including even its moreLeft/moreRight
fields) within _bt_parallel_seize, automatically and regardless of any
other details.

IIUC, the above change is the root cause. The commit 1bd4bc8 adds a
reset of
the currPos state in _bt_parallel_seize(). However, this change can
overwrite
currPos.moreRight which should be preserved before calling
_bt_readnextpage().

I can easily recreate the problem using your test case. I agree with
your diagnosis. Thanks for the report!

We could fix this by going back to how things were before -- don't
unset moreLeft/moreRight within _bt_parallel_seize. But that doesn't
seem like a good idea to me. It would be weird to depend on the
currPos state from the last _bt_readpage call, even after seizing the
scan for the next page in line. If the scan has already finished
(according to moreLeft/moreRight), then why even try to seize the scan
once again? Why not just end the scan instead (including its call to
_bt_parallel_done)?

* Test case

-- Without commit 1bd4bc8,
the number was '236' in my environment.
Planning Time: 0.105 ms
Execution Time: 71.454 ms
(18 rows)

With the attached patch, I get back to the good/correct behavior.

I'm not sure that this is the exact right way to fix the bug, but it
seems like the right general approach.

--
Peter Geoghegan

Attachments:

v1-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchapplication/octet-stream; name=v1-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchDownload

From 44a0be6f76f6dc9587fb5fa18ec0be3c27712f94 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 7 Nov 2024 14:09:47 -0500
Subject: [PATCH v1] Fix confusion with nbtree parallel scan currPos state.

---
 src/backend/access/nbtree/nbtsearch.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1608dd49d..70996a7fd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2224,6 +2224,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		seized = true;
 
 	Assert(so->currPos.currPage == lastcurrblkno || scan->parallel_scan != NULL);
 	Assert(!BTScanPosIsPinned(so->currPos));
@@ -2254,6 +2255,15 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		Assert(!so->needPrimScan);
 
+		/* parallel scan must never use so->currPos blkno */
+		if (!seized && scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		seized = false;
+
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
@@ -2308,14 +2318,6 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		/* no matching tuples on this page */
 		_bt_relbuf(rel, so->currPos.buf);
-
-		/* parallel scan seizes another page (won't use so->currPos blkno) */
-		if (scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
 	}
 
 	/*
-- 
2.45.2

#16

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Peter Geoghegan (#15)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Thu, Nov 7, 2024 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:

It would be weird to depend on the
currPos state from the last _bt_readpage call, even after seizing the
scan for the next page in line. If the scan has already finished
(according to moreLeft/moreRight), then why even try to seize the scan
once again? Why not just end the scan instead (including its call to
_bt_parallel_done)?

The problem with what I said here was that v1 didn't go far enough in
this direction. v1 would still needlessly seize the scan whenever
_bt_steppage's blkno was involved. This was still correct, I think --
but it wasn't clear.

Attached revision v2 pushes the fix further in this direction. It
explicitly documents that parallel _bt_readnextpage callers are
allowed to use their so->currPos state (including
blkno/nextPage/prevPage) to help _bt_readnextpage to decide whether it
can end the scan right away. However, parallel index scan callers
still won't be allowed to use this same so->currPos state to decide
what page to go to next -- _bt_readnextpage really will need to seize
the scan to get an authoritative blkno to make that part safe
(actually, one parallel scan _bt_readnextpage caller still seizes the
scan for itself, which is the one caller where _bt_readnextpage fully
trusts caller's blkno + lastcurrblkno).

I think that this approach is a win for readability. It makes the
parallel case closer to the serial case, which is generally a good
thing: it allows _bt_steppage to not care at all about whether or not
the scan is a parallel scan. More importantly, it makes the exact
extent to which parallel scans rely on their so->currPos state a great
deal clearer. That was the underlying problem here all along:
confusion about using the so->currPos state to end the scan vs. using
the so->currPos state to decide which leaf page to visit next. These
are two different things, and only the first one is generally safe.

One problem that my v2 approach created (at least when I first started
work on it) was that the new _bt_readnextpage loop code was
ill-prepared for a blkno that's P_NONE when we seize the scan -- the
part of the loop where that's checked for came first, before the scan
is seized. I could have fixed that new problem by adding an extra
defensive is-blkno-P_NONE code to _bt_readnextpage. But that's not
what I settled on for v2. It makes more sense to just stop allowing
the parallel scan's btps_nextScanPage to become P_NONE in the first
place. The fact that we ever that never made much sense to me -- why
wouldn't we just stop the scan as soon as we reach the point where the
"next block" is P_NONE, which is only possible on the rightmost and
leftmost page? So that's what we do. It's simpler all around.

--
Peter Geoghegan

Attachments:

v2-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchapplication/octet-stream; name=v2-0001-Fix-confusion-with-nbtree-parallel-scan-currPos-s.patchDownload

From ecad16b0bc692180d098bcf39fecfd3d0163c7d9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 7 Nov 2024 16:21:03 -0500
Subject: [PATCH v2] Fix confusion with nbtree parallel scan currPos state.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Diagnosed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Discussion: https://postgr.es/m/f8efb9c0f8d1a71b44fd7f8e42e49c25@oss.nttdata.com
---
 src/backend/access/nbtree/nbtree.c    | 31 ++++++---
 src/backend/access/nbtree/nbtsearch.c | 97 +++++++++++----------------
 2 files changed, 62 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 2919b1263..cd3c6f4f7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -596,9 +596,7 @@ btparallelrescan(IndexScanDesc scan)
  * scan, and *last_curr_page returns the page that *next_scan_page came from.
  * An invalid *next_scan_page means the scan hasn't yet started, or that
  * caller needs to start the next primitive index scan (if it's the latter
- * case we'll set so.needPrimScan).  The first time a participating process
- * reaches the last page, it will return true and set *next_scan_page to
- * P_NONE; after that, further attempts to seize the scan will return false.
+ * case we'll set so.needPrimScan).
  *
  * Callers should ignore the value of *next_scan_page and *last_curr_page if
  * the return value is false.
@@ -724,6 +722,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
  * that it can be passed to _bt_parallel_primscan_schedule, should caller
  * determine that another primitive index scan is required.
  *
+ * If caller's next_scan_page is P_NONE, the scan has reached the index's
+ * rightmost/leftmost page.  We'll end the parallel scan right away.
+ *
  * Note: unlike the serial case, parallel scans don't need to remember both
  * sibling links.  next_scan_page is whichever link is next given the scan's
  * direction.  That's all we'll ever need, since the direction of a parallel
@@ -735,16 +736,30 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
+	bool		ended_scan = false;
+
+	Assert(BlockNumberIsValid(next_scan_page));
 
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	btscan->btps_nextScanPage = next_scan_page;
-	btscan->btps_lastCurrPage = curr_page;
-	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	if (next_scan_page == P_NONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		ended_scan = true;
+	}
+	else
+	{
+		btscan->btps_nextScanPage = next_scan_page;
+		btscan->btps_lastCurrPage = curr_page;
+		btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	}
 	SpinLockRelease(&btscan->btps_mutex);
-	ConditionVariableSignal(&btscan->btps_cv);
+	if (ended_scan)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+	else
+		ConditionVariableSignal(&btscan->btps_cv);
 }
 
 /*
@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)
 
 	/*
 	 * Should not mark parallel scan done when there's still a pending
-	 * primitive index scan (defensive)
+	 * primitive index scan
 	 */
 	if (so->needPrimScan)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1608dd49d..a3f215e17 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -46,7 +46,8 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
 							  ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir);
+							 BlockNumber lastcurrblkno, ScanDirection dir,
+							 bool seized);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -924,9 +925,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		BlockNumber blkno,
 					lastcurrblkno;
-		bool		status;
 
-		status = _bt_parallel_seize(scan, &blkno, &lastcurrblkno, true);
+		if (!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
+			return false;
 
 		/*
 		 * Initialize arrays (when _bt_parallel_seize didn't already set up
@@ -935,14 +936,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (so->numArrayKeys && !so->needPrimScan)
 			_bt_start_array_keys(scan, dir);
 
-		if (!status)
-			return false;
-		else if (blkno == P_NONE)
-		{
-			_bt_parallel_done(scan);
-			return false;
-		}
-		else if (blkno != InvalidBlockNumber)
+		if (blkno != InvalidBlockNumber)
 		{
 			Assert(!so->needPrimScan);
 
@@ -950,7 +944,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 * We anticipated starting another primitive scan, but some other
 			 * worker bet us to it
 			 */
-			if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir))
+			if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
 				return false;
 			goto readcomplete;
 		}
@@ -2012,12 +2006,8 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * a valid block, in any case.
  *
  * This is a wrapper on _bt_readnextpage that performs final steps for the
- * current page.  It sets up the _bt_readnextpage call using either local
- * state saved in so->currPos by the most recent _bt_readpage call, or using
- * shared parallel scan state (obtained by seizing the parallel scan here).
- *
- * Parallel scan callers that have already seized the scan should directly
- * call _bt_readnextpage, rather than calling here.
+ * current page.  It sets up the _bt_readnextpage call using local state saved
+ * in so->currPos by the most recent _bt_readpage call.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
@@ -2081,37 +2071,22 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	BTScanPosUnpinIfPinned(so->currPos);
 
 	/* Walk to the next page with data */
-	if (!scan->parallel_scan)
-	{
-		/* Not parallel, so use local state set by the last _bt_readpage */
-		if (ScanDirectionIsForward(dir))
-			blkno = so->currPos.nextPage;
-		else
-			blkno = so->currPos.prevPage;
-		lastcurrblkno = so->currPos.currPage;
-
-		/*
-		 * Cancel primitive index scans that were scheduled when the call to
-		 * _bt_readpage for currPos happened to use the opposite direction to
-		 * the one that we're stepping in now.  (It's okay to leave the scan's
-		 * array keys as-is, since the next _bt_readpage will advance them.)
-		 */
-		if (so->currPos.dir != dir)
-			so->needPrimScan = false;
-	}
+	if (ScanDirectionIsForward(dir))
+		blkno = so->currPos.nextPage;
 	else
-	{
-		/*
-		 * Seize the scan to get the nextPage and currPage from shared
-		 * parallel state (saved from parallel scan's last _bt_readpage)
-		 */
-		if (!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-			return false;
+		blkno = so->currPos.prevPage;
+	lastcurrblkno = so->currPos.currPage;
 
-		Assert(!so->needPrimScan);
-	}
+	/*
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for currPos happened to use the opposite direction to the
+	 * one that we're stepping in now.  (It's okay to leave the scan's array
+	 * keys as-is, since the next _bt_readpage will advance them.)
+	 */
+	if (so->currPos.dir != dir)
+		so->needPrimScan = false;
 
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir);
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2203,8 +2178,13 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  *
  * On entry, caller shouldn't hold any locks or pins on any page (we work
  * directly off of blkno and lastcurrblkno instead).  Parallel scan callers
- * must have seized the scan before calling here (blkno and lastcurrblkno
- * arguments should come from the seized scan).
+ * that seized the scan before calling here should pass seized=true; such a
+ * caller's blkno and lastcurrblkno arguments come from the seized scan.
+ * seized=false parallel scan callers just pass us the blkno/lastcurrblkno
+ * taken from their so->currPos, which can be used to end the scan, but will
+ * never be used to determine which page to read next (we must seize the scan
+ * to get the blkno that we have to read next, since the correct page to read
+ * might already be beyond a seized=false caller's blkno).
  *
  * On success exit, so->currPos is updated to contain data from the next
  * interesting page, and we return true (parallel scan callers should not use
@@ -2220,12 +2200,12 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  */
 static bool
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	Assert(so->currPos.currPage == lastcurrblkno || scan->parallel_scan != NULL);
+	Assert(so->currPos.currPage == lastcurrblkno || seized);
 	Assert(!BTScanPosIsPinned(so->currPos));
 
 	/*
@@ -2254,6 +2234,14 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		Assert(!so->needPrimScan);
 
+		/* parallel scan must never actually visit so->currPos blkno */
+		if (!seized && scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
@@ -2308,14 +2296,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		/* no matching tuples on this page */
 		_bt_relbuf(rel, so->currPos.buf);
-
-		/* parallel scan seizes another page (won't use so->currPos blkno) */
-		if (scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		seized = false;			/* released by _bt_readpage (or by us) */
 	}
 
 	/*
-- 
2.45.2

#17

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 1 year ago

In reply to: Peter Geoghegan (#16)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On 2024-11-08 07:42, Peter Geoghegan wrote:

Attached revision v2 pushes the fix further in this direction. It
explicitly documents that parallel _bt_readnextpage callers are
allowed to use their so->currPos state (including
blkno/nextPage/prevPage) to help _bt_readnextpage to decide whether it
can end the scan right away. However, parallel index scan callers
still won't be allowed to use this same so->currPos state to decide
what page to go to next -- _bt_readnextpage really will need to seize
the scan to get an authoritative blkno to make that part safe
(actually, one parallel scan _bt_readnextpage caller still seizes the
scan for itself, which is the one caller where _bt_readnextpage fully
trusts caller's blkno + lastcurrblkno).

Thank you! I've confirmed that the v2 patch fixed the bug. As you
mentioned, I also feel that the v2 patch is now easier to understand.

Since I couldn't understand the reason, I have a question: is the
following deletion related to this change?

@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)

         /*
          * Should not mark parallel scan done when there's still a 
pending
-        * primitive index scan (defensive)
+        * primitive index scan
          */

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#18

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Masahiro Ikeda (#17)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Fri, Nov 8, 2024 at 8:25 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

Thank you! I've confirmed that the v2 patch fixed the bug. As you
mentioned, I also feel that the v2 patch is now easier to understand.

Great. I pushed something quite similar to v2 just now. Thanks!

Since I couldn't understand the reason, I have a question: is the
following deletion related to this change?

@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)
/*
* Should not mark parallel scan done when there's still a
pending
-        * primitive index scan (defensive)
+        * primitive index scan
*/

That's a good question.

Prior to this bugfix, the check of so->needPrimScan from within
_bt_parallel_done() was defensive; it wasn't strictly necessary. It
could have been "Assert(!so->needPrimScan)" instead (I guess that I
didn't make it into an assert like this because _bt_parallel_done
worked a little like this, prior to Postgres 17, when we had a
primscan counter instead of the current so->needPrimScan flag). But
that's no longer the case with the bugfix in place; the
so->needPrimScan check really is strictly necessary now.

It's hard to see why this is. Notice that _bt_parallel_seize() will
just return false when another primitive index scan is required. Prior
to this bugfix, we'd seize the parallel scan within _bt_steppage,
which could only succeed when "!so->needPrimScan" (which was actually
verified by an assertion that just got removed). With this bug fix,
nothing stops the so->needPrimScan flag from still being set from back
when we called _bt_readpage for the so->currPos we're using. And so,
and as I said, the check of so->needPrimScan from within
_bt_parallel_done() became strictly necessary (not just defensive) --
since so->needPrimScan definitely can be set when we call
_bt_parallel_done, and we definitely don't want to *really* end the
whole top-level scan when it is set (we must never confuse the end of
one primitive index scan with the end of the whole top-level parallel
index scan).

--
Peter Geoghegan

#19

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 1 year ago

In reply to: Peter Geoghegan (#18)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On 2024-11-09 03:40, Peter Geoghegan wrote:

On Fri, Nov 8, 2024 at 8:25 AM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
Since I couldn't understand the reason, I have a question: is the
following deletion related to this change?

@@ -770,7 +785,7 @@ _bt_parallel_done(IndexScanDesc scan)
/*
* Should not mark parallel scan done when there's still a
pending
-        * primitive index scan (defensive)
+        * primitive index scan
*/
That's a good question.

Prior to this bugfix, the check of so->needPrimScan from within
_bt_parallel_done() was defensive; it wasn't strictly necessary. It
could have been "Assert(!so->needPrimScan)" instead (I guess that I
didn't make it into an assert like this because _bt_parallel_done
worked a little like this, prior to Postgres 17, when we had a
primscan counter instead of the current so->needPrimScan flag). But
that's no longer the case with the bugfix in place; the
so->needPrimScan check really is strictly necessary now.

It's hard to see why this is. Notice that _bt_parallel_seize() will
just return false when another primitive index scan is required. Prior
to this bugfix, we'd seize the parallel scan within _bt_steppage,
which could only succeed when "!so->needPrimScan" (which was actually
verified by an assertion that just got removed). With this bug fix,
nothing stops the so->needPrimScan flag from still being set from back
when we called _bt_readpage for the so->currPos we're using. And so,
and as I said, the check of so->needPrimScan from within
_bt_parallel_done() became strictly necessary (not just defensive) --
since so->needPrimScan definitely can be set when we call
_bt_parallel_done, and we definitely don't want to *really* end the
whole top-level scan when it is set (we must never confuse the end of
one primitive index scan with the end of the whole top-level parallel
index scan).

I understand, thanks to your explanation.

Now, there is a case where _bt_readnextpage() calls
_bt_parallel_seize(),
_bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is
called
with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize()
was
called after _bt_readpage() sets so->needPrimScan=true, and it just
returned
false without calling _bt_parallel_done().

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#20

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Masahiro Ikeda (#19)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Sun, Nov 10, 2024 at 9:53 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

I understand, thanks to your explanation.

Cool.

Now, there is a case where _bt_readnextpage() calls
_bt_parallel_seize(),
_bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is
called
with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize()
was
called after _bt_readpage() sets so->needPrimScan=true, and it just
returned
false without calling _bt_parallel_done().

You influenced me to add something about this to my follow-up commit caca6d8d:

--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2230,8 +2230,9 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
             !so->currPos.moreRight : !so->currPos.moreLeft))
        {
            /* most recent _bt_readpage call (for lastcurrblkno) ended scan */
+           Assert(so->currPos.currPage == lastcurrblkno && !seized);
            BTScanPosInvalidate(so->currPos);
-           _bt_parallel_done(scan);
+           _bt_parallel_done(scan);    /* iff !so->needPrimScan */
            return false;
        }

I added "iff !so->needPrimScan" to draw attention to the fact that we
don't necessarily really end the parallel scan when _bt_parallel_done
is called.

--
Peter Geoghegan

#21

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 1 year ago

In reply to: Peter Geoghegan (#20)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On 2024-11-11 12:19, Peter Geoghegan wrote:

On Sun, Nov 10, 2024 at 9:53 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

I understand, thanks to your explanation.

Cool.

Now, there is a case where _bt_readnextpage() calls
_bt_parallel_seize(),
_bt_readpage() sets so->needPrimScan=true, and _bt_parallel_done() is
called
with so->needPrimScan=true. Prior to this bugfix, _bt_parallel_seize()
was
called after _bt_readpage() sets so->needPrimScan=true, and it just
returned
false without calling _bt_parallel_done().

You influenced me to add something about this to my follow-up commit
caca6d8d:
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2230,8 +2230,9 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber 
blkno,
!so->currPos.moreRight : !so->currPos.moreLeft))
{
/* most recent _bt_readpage call (for lastcurrblkno) ended 
scan */
+           Assert(so->currPos.currPage == lastcurrblkno && !seized);
BTScanPosInvalidate(so->currPos);
-           _bt_parallel_done(scan);
+           _bt_parallel_done(scan);    /* iff !so->needPrimScan */
return false;
}
I added "iff !so->needPrimScan" to draw attention to the fact that we
don't necessarily really end the parallel scan when _bt_parallel_done
is called.

Thanks! The change made it easier for me to understand.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#22

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Masahiro Ikeda (#21)

1 attachment(s)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Sun, Nov 10, 2024 at 11:36 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Thanks! The change made it easier for me to understand.

As follow-up to all of the recent work in this area, I'd like to add
this wrapper function to return the next item from so->currPos.

The wrapper function has extra assertions, compared to what we do
already. It's slightly more defensive, and IMV slightly clearer.

--
Peter Geoghegan

Attachments:

v1-0001-Use-_bt_returnitem-wrapper-function.patchapplication/octet-stream; name=v1-0001-Use-_bt_returnitem-wrapper-function.patchDownload

From bb9f297ebf46559b294ef9a146e0bee59ec0440f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 12 Nov 2024 09:33:06 -0500
Subject: [PATCH v1] Use _bt_returnitem wrapper function

---
 src/backend/access/nbtree/nbtsearch.c | 55 ++++++++++++++-------------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2786a8564..b08e68e4f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,6 +42,7 @@ static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
+static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
 							  ScanDirection dir);
@@ -890,7 +891,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysz = 0;
 	StrategyNumber strat_total;
-	BTScanPosItem *currItem;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -950,7 +950,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			 */
 			if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
 				return false;
-			goto readcomplete;
+
+			_bt_returnitem(scan, so);
+			return true;
 		}
 	}
 	else if (so->numArrayKeys && !so->needPrimScan)
@@ -1438,14 +1440,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!_bt_readfirstpage(scan, offnum, dir))
 		return false;
 
-readcomplete:
-	/* OK, itemIndex says what to return */
-	Assert(BTScanPosIsValid(so->currPos));
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-	if (scan->xs_want_itup)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-
+	_bt_returnitem(scan, so);
 	return true;
 }
 
@@ -1468,7 +1463,6 @@ bool
 _bt_next(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BTScanPosItem *currItem;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1493,13 +1487,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
-	/* OK, itemIndex says what to return */
-	Assert(BTScanPosIsValid(so->currPos));
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-	if (scan->xs_want_itup)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-
+	_bt_returnitem(scan, so);
 	return true;
 }
 
@@ -1564,6 +1552,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
 	so->currPos.dir = dir;
 	so->currPos.nextTupleOffset = 0;
+	/* either moreLeft or moreRight should be set, and may be unset here */
+	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
+		   so->currPos.moreLeft);
 
 	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
 
@@ -2002,6 +1993,25 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 		currItem->tupleOffset = tupleOffset;
 }
 
+/*
+ * Return the index item from so->currPos.items[itemIndex] to the index scan
+ * via its scan descriptor
+ */
+static inline void
+_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
+{
+	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
+
+	/* Most recent _bt_readpage must have succeeded */
+	Assert(BTScanPosIsValid(so->currPos));
+	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
+	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
+
+	scan->xs_heaptid = currItem->heapTid;
+	if (so->currTuples)
+		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
@@ -2543,7 +2553,6 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
-	BTScanPosItem *currItem;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -2593,12 +2602,6 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	if (!_bt_readfirstpage(scan, start, dir))
 		return false;
 
-	/* OK, itemIndex says what to return */
-	Assert(BTScanPosIsValid(so->currPos));
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-	if (scan->xs_want_itup)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-
+	_bt_returnitem(scan, so);
 	return true;
 }
-- 
2.45.2

#23

Masahiro Ikeda

ikedamsh@oss.nttdata.com

about 1 year ago

In reply to: Peter Geoghegan (#22)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On 2024-11-13 00:55, Peter Geoghegan wrote:

On Sun, Nov 10, 2024 at 11:36 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Thanks! The change made it easier for me to understand.

As follow-up to all of the recent work in this area, I'd like to add
this wrapper function to return the next item from so->currPos.

The wrapper function has extra assertions, compared to what we do
already. It's slightly more defensive, and IMV slightly clearer.

Thanks, I agree with adding the function for refactoring and including
assertions for moreLeft or moreRight.

One thing I was concerned about is that "if (scan->xs_want_itup)" was
changed to "if (so->currTuples)". However, this isn’t an issue because
so->currTuples is only initialized if scan->xs_want_itup is set to
true in btrescan(), and it improves consistency with other functions
for index-only scans.

I also confirmed that make check-world passes with the patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#24

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Masahiro Ikeda (#23)

Re: Avoiding superfluous buffer locking during nbtree backwards scans

On Tue, Nov 12, 2024 at 10:14 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2024-11-13 00:55, Peter Geoghegan wrote:

The wrapper function has extra assertions, compared to what we do
already. It's slightly more defensive, and IMV slightly clearer.

Thanks, I agree with adding the function for refactoring and including
assertions for moreLeft or moreRight.

Pushed.

Thanks again
--
Peter Geoghegan