Limiting overshoot in nbtree's parallel SAOP index scans

Started by Matthias van de Meentover 1 year ago8 messages
#1Matthias van de Meent
boekewurm@gmail.com
1 attachment(s)

Hi,

With PG17's new SAOP handling the performance of certain index scans
significantly improved performance in the serial case. However, for
parallel index scans the performance picture is not as
straightforward, and this has caused some issues earlier.

Background
----------
Before PG17, we had one btree descent + scan for every unique
combination of SAOP operators. With PG17, that changed to one btree
descent + scan, plus however many btree descents we decide are needed
to skip past keys we know won't match. The difference here is that we
went from always skipping, to only skipping when we think it's
beneficial.

In parallel index scans, before PG17 there was a counter which
maintained the number of SAOP entries handled. This meant the IO
patterns between normal scans and parallel scans was about the same,
as the critical path for IO were almost exactly the same with no
specific race conditions - with a certain scan all backends will agree
on the state of the scan.

With the introduction of the new SAOP handling in PG17, however, the
shared state has become a bit more muddied. Because the default has
changed from "stop scanning at the end of a SAOP value's range" to
"continue scanning, unless ...", and because it is relatively
expensive to determine whether we need to stop or continue we release
the parallel scan to continue before we know if a new primitive scan
would be preferred.

Problem
-------
The new parallel index scan code can only get into a new primitive
scan IFF no parallel worker has taken the next block in the time
between the call to _bt_parallel_release at the start of _bt_readpage,
and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
slightly later in _bt_readpage. This makes the code susceptible to
race conditions, and in the worst case parallel SAOP index scans can
behave like a range index scan for the the full value range of the
SAOP parameters, rather than being limited to only scanning leaf pages
that contain value ranges that match the SAOP parameters, skidding
past what would be considered bounds in the serial scan variant.

Given the above, a parallel index scan with ScanKey `id = ANY
(minvalue, maxvalue)` and bad enough timing/luck could thus scan the
whole index, while just 2 primitive index scans (the behaviour before
PG17) are sufficient.

In practice it is quite unlikely that parallel index scans have such
bad luck, but I don't think we should have luck or timing as a big
factor for the performance picture in parallel index scans.

Solution
--------

I think I've found a reasonably elegant solution, which limits the
number of buffers we can fail to start a new primitive scan to the
number of parallel workers + 1: If we detect that a parallel worker
in the same primscan range thinks this is the right moment to start a
new primscan (but couldn't due to concurrent advancement), we don't
release the parallel scan immediately, but instead only release it
after reading the pages contents, to find out if we really should
start a new primitive scan. If so, we do that, and if not, we instead
mark that the primitive scan has reached a new primscan range, do some
cleanup, and then continue business as usual.

There is some space to tune the number of buffers we can fail to start
a new primitive scan by locally keeping track of the number of pages
we've processed without starting an expected primitive scan, but I
feel that it'd result in overall worse performance if we tried to wait
for more time than just doing the skip check ASAP.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v1-0001-Avoid-full-btree-index-scans-when-skipping-is-pos.patchapplication/octet-stream; name=v1-0001-Avoid-full-btree-index-scans-when-skipping-is-pos.patchDownload
From 54127e3be41ac51ff3640960c08b4e65aa4a4892 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 11 Oct 2024 15:57:27 +0200
Subject: [PATCH v1] Avoid full btree index scans when skipping is possible

Previously, we could ignore the skip signal until the end of the range of
values producable by the index scan key. Now, we can fail to start a new
primscan only for up to number of parallel workers + 1 buffers, at the cost
of doing a bit more before releasing the scan while we process the 'we may
need a new primitive scan' signal.

If we detect that a parallel worker in the same primscan range thinks
this is the right moment to start a new primitive scan, we don't release
the parallel scan immediately, but instead only release it after reading
the pages contents to find out if we really should start a new primitive
scan.  If so, we start that new primitive scan, and if instead we find
we've already skidded into the range of pages we would've arrived on with
the skip scan, we instead mark that the primitive scan has reached a new
primscan range, do some cleanup, and then continue the scan as usual.
---
 src/include/access/nbtree.h           |  10 ++-
 src/backend/access/nbtree/nbtree.c    | 112 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtsearch.c |  15 +++-
 3 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d64300fb97..efe6728dd3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1052,6 +1052,11 @@ typedef struct BTScanOpaqueData
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
+	/* local state for coordinating skips in parallel scans */
+	bool		testPrimScan;	/* Are we trying to do a new primitive scan */
+	uint32		arrElemsGen;	/* Generation number of prim scan we want to
+								 * improve on */
+
 	/* info about killed items if any (killedItems is NULL if never used) */
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
@@ -1193,7 +1198,10 @@ extern int	btgettreeheight(Relation rel);
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
 							   bool first);
-extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_opt_release_early(IndexScanDesc scan,
+										   BlockNumber scan_page);
+extern void _bt_parallel_opt_release_late(IndexScanDesc scan,
+										  BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
 										   BlockNumber prev_scan_page);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 56e502c4fc..04c8d6e786 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -70,12 +70,15 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
+	uint32		btps_arrElemsGen;	/* number of new prim scan opportunities */
+	bool		btps_checkPrimScan;	/* did we skid past the most opportune
+									 * endpoint of a primitive scan? */
 	slock_t		btps_mutex;		/* protects above variables, btps_arrElems */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 
 	/*
 	 * btps_arrElems is used when scans need to schedule another primitive
-	 * index scan.  Holds BTArrayKeyInfo.cur_elem offsets for scan keys.
+	 * index scan.  Holds the values for BTScanOpaque->arrayKeys[.].cur_elem.
 	 */
 	int			btps_arrElems[FLEXIBLE_ARRAY_MEMBER];
 }			BTParallelScanDescData;
@@ -335,6 +338,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->arrayKeys = NULL;
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
+	
+	so->testPrimScan = false;
+	so->arrElemsGen = 0;
 
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
@@ -550,6 +556,8 @@ btinitparallelscan(void *target)
 	SpinLockInit(&bt_target->btps_mutex);
 	bt_target->btps_scanPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrElemsGen = 0;
+	bt_target->btps_checkPrimScan = false;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -575,13 +583,15 @@ btparallelrescan(IndexScanDesc scan)
 	SpinLockAcquire(&btscan->btps_mutex);
 	btscan->btps_scanPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrElemsGen = 0;
+	btscan->btps_checkPrimScan = false;
 	SpinLockRelease(&btscan->btps_mutex);
 }
 
 /*
  * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
- *		page.  Other scans must wait until we call _bt_parallel_release()
- *		or _bt_parallel_done().
+ *		page.  Other scans must wait until we call _bt_parallel_done(),
+ *		[_btp]_opt_release_early/late(), or [_btp]_primscan_schedule().
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs when no pages remain, or when
@@ -640,6 +650,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 		{
 			/* We're done with this parallel index scan */
 			status = false;
+			so->testPrimScan = false;
+			so->arrElemsGen = 0;
 		}
 		else if (btscan->btps_pageStatus == BTPARALLEL_NEED_PRIMSCAN)
 		{
@@ -659,6 +671,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 				}
 				*pageno = InvalidBlockNumber;
 				exit_loop = true;
+				so->arrElemsGen = btscan->btps_arrElemsGen;
+				so->testPrimScan = false;
 			}
 			else
 			{
@@ -685,6 +699,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 			 */
 			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
 			*pageno = btscan->btps_scanPage;
+
+			so->arrElemsGen = btscan->btps_arrElemsGen;
+
 			exit_loop = true;
 		}
 		SpinLockRelease(&btscan->btps_mutex);
@@ -698,7 +715,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
 }
 
 /*
- * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ * _bt_parallel_opt_release_early() -- Complete the process of advancing the scan to a
  *		new page.  We now have the new value btps_scanPage; some other backend
  *		can now begin advancing the scan.
  *
@@ -709,19 +726,76 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
  * scan lands on scan_page).
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+_bt_parallel_opt_release_early(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	/*
+	 * If a parallel worker noticed that it had skipped past the end of a
+	 * primitive scan after another backend already acquired the parallel scan
+	 * status, we don't release the scan before reading the page's contents.
+	 * Instead, we transition to a position which will 
+	 */
+	if (likely(!btscan->btps_checkPrimScan))
+	{
+		btscan->btps_scanPage = scan_page;
+		btscan->btps_pageStatus = BTPARALLEL_IDLE;
+		SpinLockRelease(&btscan->btps_mutex);
+		ConditionVariableSignal(&btscan->btps_cv);
+	}
+	else
+	{
+		BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+		so->testPrimScan = true;
+		SpinLockRelease(&btscan->btps_mutex);
+	}
+}
+
+/*
+ * _bt_parallel_opt_release_late() -- Complete the process of advancing the
+ *		scan to a new page.
+ *
+ * We're only called when a concurrent backend wanted to schedule a skip scan,
+ * but failed to do so because the parallel scan already advanced past its
+ * own page.  
+ */
+void
+_bt_parallel_opt_release_late(IndexScanDesc scan, BlockNumber scan_page)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTParallelScanDesc btscan;
+	
+	if (!so->testPrimScan)
+		return;
 
 	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
+	Assert(btscan->btps_checkPrimScan);
 	btscan->btps_scanPage = scan_page;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+
+	/*
+	 * A late release implies that 1) a concurrent backend noticed we
+	 * should've started a new primitive scan, and that 2) the current scan
+	 * position is already at-or-past the point where that scan would've
+	 * started.  So, we do what a new primitive scan would've done with the
+	 * shared state: we increase the generation number, and unset
+	 * checkPrimScan.
+	 */
+	btscan->btps_checkPrimScan = false;
+	btscan->btps_arrElemsGen += 1;
 	SpinLockRelease(&btscan->btps_mutex);
 	ConditionVariableSignal(&btscan->btps_cv);
+
+	so->testPrimScan = false;
 }
 
 /*
@@ -738,6 +812,7 @@ _bt_parallel_done(IndexScanDesc scan)
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
+	so->testPrimScan = false;
 
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
@@ -774,10 +849,10 @@ _bt_parallel_done(IndexScanDesc scan)
 /*
  * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Caller passes the block number most recently passed to _bt_parallel_release
- * by its backend.  Caller successfully schedules the next primitive index scan
- * if the shared parallel state hasn't been seized since caller's backend last
- * advanced the scan.
+ * Caller passes the block number most recently passed to
+ * _bt_parallel_opt_release_early by its backend.  Caller successfully
+ * schedules the next primitive index scan if the shared parallel state hasn't
+ * been seized since caller's backend last advanced the scan.
  */
 void
 _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
@@ -792,11 +867,13 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 												  parallel_scan->ps_offset);
 
 	SpinLockAcquire(&btscan->btps_mutex);
-	if (btscan->btps_scanPage == prev_scan_page &&
-		btscan->btps_pageStatus == BTPARALLEL_IDLE)
+	if ((btscan->btps_scanPage == prev_scan_page &&
+		 btscan->btps_pageStatus == BTPARALLEL_IDLE) ||
+		unlikely(so->testPrimScan))
 	{
 		btscan->btps_scanPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
+		btscan->btps_arrElemsGen += 1;
 
 		/* Serialize scan's current array keys */
 		for (int i = 0; i < so->numArrayKeys; i++)
@@ -806,7 +883,20 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
 			btscan->btps_arrElems[i] = array->cur_elem;
 		}
 	}
+	/*
+	 * If the shared array keys are still those of the primitive scan we used
+	 * to access prev_scan_page, make a note that the next page may be a good
+	 * opportunity to start a new primitive scan.  Once marked, a worker will
+	 * not release the scan until it has processed its page and knows for
+	 * sure whether a new prim scan is needed.
+	 */
+	else if (btscan->btps_arrElemsGen == so->arrElemsGen)
+	{
+		btscan->btps_checkPrimScan = true;
+	}
 	SpinLockRelease(&btscan->btps_mutex);
+
+	so->testPrimScan = false;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fff7c89ead..8a9a0e6626 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1555,7 +1555,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
+ * _bt_parallel_opt_release_early before returning.
  *
  * Returns true if any matching items found on the page, false if none.
  */
@@ -1590,7 +1590,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		else
 			pstate.prev_scan_page = BufferGetBlockNumber(so->currPos.buf);
 
-		_bt_parallel_release(scan, pstate.prev_scan_page);
+		_bt_parallel_opt_release_early(scan, pstate.prev_scan_page);
 	}
 
 	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
@@ -1943,6 +1943,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
+	/*
+	 * If !continuescan, releasing will be or has been done by either
+	 * [_btp]_done or [_btp]_skipscan_schedule.
+	 */
+	if (scan->parallel_scan && pstate.continuescan)
+		_bt_parallel_opt_release_late(scan, pstate.prev_scan_page);
+
 	return (so->currPos.firstItem <= so->currPos.lastItem);
 }
 
@@ -2218,7 +2225,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, opaque->btpo_next);
+				_bt_parallel_opt_release_early(scan, opaque->btpo_next);
 			}
 
 			/* nope, keep going */
@@ -2318,7 +2325,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 			else if (scan->parallel_scan != NULL)
 			{
 				/* allow next page be processed by parallel worker */
-				_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+				_bt_parallel_opt_release_early(scan, BufferGetBlockNumber(so->currPos.buf));
 			}
 
 			/*
-- 
2.45.2

In reply to: Matthias van de Meent (#1)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Fri, Oct 11, 2024 at 10:27 AM Matthias van de Meent
<boekewurm@gmail.com> wrote:

With the introduction of the new SAOP handling in PG17, however, the
shared state has become a bit more muddied. Because the default has
changed from "stop scanning at the end of a SAOP value's range" to
"continue scanning, unless ...", and because it is relatively
expensive to determine whether we need to stop or continue we release
the parallel scan to continue before we know if a new primitive scan
would be preferred.

It's not just relatively expensive. It's essential that _bt_readpage
be able to release the parallel scan at the earliest opportunity (just
as soon as the next sibling link is known). Whereas a parallel backend
worker will only decide that it's likely a good idea to do another
primitive index scan when _bt_readpage has finished executing.

Clearly it would not be sensible to wait until _bt_readpage had gotten
as far as deciding on the need for another descent of the index. To do
so would essentially serialize the scan. There are clear advantages to
doing as little coordination as possible among backends. (I don't
think that you disagree with any of this, but just want to establish
context for anybody else that's reading along -- you seem to want to
do some limited/conditional form of this behavior.)

Problem
-------
The new parallel index scan code can only get into a new primitive
scan IFF no parallel worker has taken the next block in the time
between the call to _bt_parallel_release at the start of _bt_readpage,
and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
slightly later in _bt_readpage. This makes the code susceptible to
race conditions

Do you have a test case?

I was aware of the issues in this area when designing the
_bt_parallel_primscan_schedule mechanism. I believe it's true that the
scan is only strictly guaranteed to be able to make forward progress
via naively scanning the next leaf page, again and again. This is a
fairly theoretical point, though.

I believe that it's practically impossible for this to cause us
problems. That would require a group of backends that all want to
start another primitive index scan, but each block each other out,
constantly. Not just once, mind you -- again and again (enough to be
noticeable). This would have to happen in spite of the fact that we
only release one backend per _bt_parallel_release call
(_bt_parallel_release uses ConditionVariableSignal under the hood,
which only releases the oldest worker waiting on the CV, not all
backends).

There is no signalling of condition variables within
_bt_parallel_primscan_schedule, either. So you just have this one call
to ConditionVariableSignal. Why should it be possible to get a
stampede of parallel backends? (I'm not 100% sure if your concern is
even theoretically valid.)

Given the above, a parallel index scan with ScanKey `id = ANY
(minvalue, maxvalue)` and bad enough timing/luck could thus scan the
whole index, while just 2 primitive index scans (the behaviour before
PG17) are sufficient.

In practice it is quite unlikely that parallel index scans have such
bad luck, but I don't think we should have luck or timing as a big
factor for the performance picture in parallel index scans.

It's also possible for a hash table to degenerate when there are
excessively many hash collisions. This is certainly possible with
adversarial inputs -- it's not hard to write a test case that'll
generate some. So hash tables generally also "have luck as a big
factor", in about the same sense. No?

I don't think that it's fundamentally unreasonable to have concerns
about things like this, of course. Safety-critical avionics systems
tend to just not use hash tables at all.

I think I've found a reasonably elegant solution, which limits the
number of buffers we can fail to start a new primitive scan to the
number of parallel workers + 1: If we detect that a parallel worker
in the same primscan range thinks this is the right moment to start a
new primscan (but couldn't due to concurrent advancement), we don't
release the parallel scan immediately, but instead only release it
after reading the pages contents, to find out if we really should
start a new primitive scan. If so, we do that, and if not, we instead
mark that the primitive scan has reached a new primscan range, do some
cleanup, and then continue business as usual.

I'm opposed to this patch. It'll introduce a behavior that's more
difficult to reason about and test then the one that it purports to
fix.

--
Peter Geoghegan

#3Matthias van de Meent
boekewurm@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Wed, 16 Oct 2024 at 20:52, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Oct 11, 2024 at 10:27 AM Matthias van de Meent
<boekewurm@gmail.com> wrote:

With the introduction of the new SAOP handling in PG17, however, the
shared state has become a bit more muddied. Because the default has
changed from "stop scanning at the end of a SAOP value's range" to
"continue scanning, unless ...", and because it is relatively
expensive to determine whether we need to stop or continue we release
the parallel scan to continue before we know if a new primitive scan
would be preferred.

It's not just relatively expensive. It's essential that _bt_readpage
be able to release the parallel scan at the earliest opportunity (just
as soon as the next sibling link is known).

Why is this essential? AFAIK, that's only for performance reasons -
and it's also only for performance reasons that we start a new
primitive scan, so in my view there is a trade-off between releasing
the scan early (and making better use of parallel resources) and
scheduling a new primitive scan (to reduce overall resource usage).

Whereas a parallel backend
worker will only decide that it's likely a good idea to do another
primitive index scan when _bt_readpage has finished executing.

Correct.

Clearly it would not be sensible to wait until _bt_readpage had gotten
as far as deciding on the need for another descent of the index. To do
so would essentially serialize the scan. There are clear advantages to
doing as little coordination as possible among backends. (I don't
think that you disagree with any of this, but just want to establish
context for anybody else that's reading along -- you seem to want to
do some limited/conditional form of this behavior.)

In general, yes, but if we see strong signals that it may be worth the
wait, then why not?

Problem
-------
The new parallel index scan code can only get into a new primitive
scan IFF no parallel worker has taken the next block in the time
between the call to _bt_parallel_release at the start of _bt_readpage,
and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
slightly later in _bt_readpage. This makes the code susceptible to
race conditions

Do you have a test case?

Yes, see below.

I was aware of the issues in this area when designing the
_bt_parallel_primscan_schedule mechanism. I believe it's true that the
scan is only strictly guaranteed to be able to make forward progress
via naively scanning the next leaf page, again and again. This is a
fairly theoretical point, though.

I believe that it's practically impossible for this to cause us
problems. That would require a group of backends that all want to
start another primitive index scan, but each block each other out,
constantly. Not just once, mind you -- again and again (enough to be
noticeable). This would have to happen in spite of the fact that we
only release one backend per _bt_parallel_release call
(_bt_parallel_release uses ConditionVariableSignal under the hood,
which only releases the oldest worker waiting on the CV, not all
backends).

I may not have communicated this well enough, but what happens in the
topical issue is as follows:

0. backend 1 has seized the scan; backend 2 is now waiting for another
page to process.
1. backend 1: _release
2. backend 1: start _readpage
3. backend 2: wake up
4. backend 2: _seize
5. backend 1: finish _readpage;
6. backend 1: _primscan_schedule -> fail to schedule.
... swap roles of backend 1 and 2, and restart at 0.

This is easy to replicate with somewhat expensive compare operator,
one which take enough CPU time so that a signaled backend can wake up
and seize the scan before a page is processed in _readpage. Because
the pages contain no interesting tuples, the scan infrastructure won't
ever need to visit HEAP VM pages or return any data from the scan in
this overshoot case, and thus won't easily get into a state that might
stop seizing the scan for long enough that _primscan_schedule actually
does something - after _readpage finds out a new primitive scan is
needed, it immediately returns and is ready to _seize the scan again.

A test case for this:

Schema and data population:
CREATE TABLE test (a int, b bigint);
INSERT INTO test (a, b) SELECT i % 101, i % 1001 FROM
generate_series(1, 10000000) i;
CREATE INDEX ON test (a, b);
DELETE FROM test WHERE a = 0 AND b = 1000;
DELETE FROM test WHERE a = 100 AND b = 0;
VACUUM (FREEZE, INDEX_CLEANUP on, ANALYZE) test;

This gets you an index of 9239 blocks.

Config to force actual parallel scan (note: debug_parallel_query
doesn't exhibit the issue due to single-worker processing of data :-(
):
SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
SET min_parallel_table_scan_size = 1;
SET min_parallel_index_scan_size = 1;

Test query:
SELECT count(*) FROM test WHERE a IN (0, 100) AND b IN (0, 1000); -- count = 196

In v17 and the master branch you'll note 16 buffer hits for the test
query. However, when we use more expensive btree compare operations
(e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
buffer access count starts to vary a lot and skyrockets to 30+ on my
machine, in some cases reaching >100 buffer hits. After applying my
patch, the buffer access count is capped to a much more agreeable
16-18 hits - it still shows signs of overshooting the serial bounds,
but the number of buffers we overshoot our target is capped and thus
significantly lower.

There is no signalling of condition variables within
_bt_parallel_primscan_schedule, either. So you just have this one call
to ConditionVariableSignal. Why should it be possible to get a
stampede of parallel backends? (I'm not 100% sure if your concern is
even theoretically valid.)

See above: the concern isn't necessarily a stampede, but just that in
some cases it is likely that a parallel worker is ready to seize the
scan before this backend's page is read to a large enough extent to
decide to schedule a new primitive scan, thus starving the scan from
opportunities to start new primscans.

Given the above, a parallel index scan with ScanKey `id = ANY
(minvalue, maxvalue)` and bad enough timing/luck could thus scan the
whole index, while just 2 primitive index scans (the behaviour before
PG17) are sufficient.

In practice it is quite unlikely that parallel index scans have such
bad luck, but I don't think we should have luck or timing as a big
factor for the performance picture in parallel index scans.

It's also possible for a hash table to degenerate when there are
excessively many hash collisions. This is certainly possible with
adversarial inputs -- it's not hard to write a test case that'll
generate some. So hash tables generally also "have luck as a big
factor", in about the same sense. No?

For hash tables, the hash function is assumed to be really good or
perfect; and degeneration is expected when this assumption is wrong.
In btree parallel scans, we explicitly expect the scan to be seized
before we get to start a new primscan. And if that happens for one
page, why would we assume that that won't also happen on the next
page, or it's next page? Presumably, it's a rare occasion, but if it
happened then the timing must line up, and if it did line up once, why
would we assume it is a one-in-a-million case?

I'm opposed to this patch. It'll introduce a behavior that's more
difficult to reason about and test then the one that it purports to
fix.

I'm happy to clarify and document the updated behavior of the parallel
scan, to make reasoning easier. I'd prefer to discontinue the
reasoning of "we may scan the whole index in parallel SAOP scans" in
favor of "we may scan a limited amount of index pages more in parallel
SAOP scans than in serial SAOP scans".

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In reply to: Matthias van de Meent (#3)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
<boekewurm@gmail.com> wrote:

In v17 and the master branch you'll note 16 buffer hits for the test
query. However, when we use more expensive btree compare operations
(e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
buffer access count starts to vary a lot and skyrockets to 30+ on my
machine, in some cases reaching >100 buffer hits. After applying my
patch, the buffer access count is capped to a much more agreeable
16-18 hits - it still shows signs of overshooting the serial bounds,
but the number of buffers we overshoot our target is capped and thus
significantly lower.

It's not exactly capped, though. Since in any case you're always prone
to getting extra leaf page reads at the end of each primitive index
scan. That's not something that's new to Postgres 17, though.

Anyway, I'm still not convinced. Your test case requires adding a one
second delay to each ORDER proc comparison, and so has an unrealistic
adversarial character. It uses an index-only scan that is drastically
faster if we don't use a parallel scan at all. The serial case is
0.046 ms for me, whereas the parallel case is 3.094 ms (obviously
that's without the addition of a 1 second delay). You've thrown
everything but the kitchen sink at the issue, and yet the impact on
buffer hits really isn't too bad.

Does anybody else have an opinion on this?

--
Peter Geoghegan

#5Matthias van de Meent
boekewurm@gmail.com
In reply to: Peter Geoghegan (#4)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Thu, 17 Oct 2024 at 00:33, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
<boekewurm@gmail.com> wrote:

In v17 and the master branch you'll note 16 buffer hits for the test
query. However, when we use more expensive btree compare operations
(e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
buffer access count starts to vary a lot and skyrockets to 30+ on my
machine, in some cases reaching >100 buffer hits. After applying my
patch, the buffer access count is capped to a much more agreeable
16-18 hits - it still shows signs of overshooting the serial bounds,
but the number of buffers we overshoot our target is capped and thus
significantly lower.

It's not exactly capped, though. Since in any case you're always prone
to getting extra leaf page reads at the end of each primitive index
scan. That's not something that's new to Postgres 17, though.

True, but the SAOP-enabled continued overshoot _is_ new: previously,
each backend would only get up to one additional buffer access for
every SOAP scan entry, while now it's only limited by outer SAOP
bounds and index size.

Anyway, I'm still not convinced. Your test case requires adding a one
second delay to each ORDER proc comparison,

microsecond. pg_usleep() takes microseconds as argument, and so
pg_usleep(1) should sleep for >=1 microsecond, but definitely much
less than 1 second.

and so has an unrealistic adversarial character.

I don't think it's too adversarial to expect some compare operations
for a page to take one microsecond, especially when that compare
operation is related to SAOPs. I think complex comparators like those
for jsonb or collated text can certainly take much more CPU time than
your usual integer compare operation, which can definitely be long
enough for other workers to seize the scan. It's just that I didn't
want to spend a lot of time building a dataset that would expose the
issue when a single and obvious modification can do the same.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#6Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#1)
1 attachment(s)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Fri, 11 Oct 2024 at 16:27, Matthias van de Meent <boekewurm@gmail.com> wrote:

Hi,

With PG17's new SAOP handling the performance of certain index scans
significantly improved performance in the serial case. However, for
parallel index scans the performance picture is not as
straightforward, and this has caused some issues earlier.

This is patch v2, now updated for HEAD to work with the newly added
skip scan changes, and some added polish.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v2-0001-Avoid-full-parallel-btree-index-scans-in-skip-sca.patchapplication/x-patch; name=v2-0001-Avoid-full-parallel-btree-index-scans-in-skip-sca.patchDownload
From 206383329cbfd8c1f00e5352fb358b847bc9b19d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 11 Oct 2024 15:57:27 +0200
Subject: [PATCH v2] Avoid full parallel btree index scans in skip scans

Previously, we could ignore the skip signal until the end of the range of
values producable by the index scan key. Now, we can fail to start a new
primscan only for up to number of parallel workers + 1 buffers, at the
cost of processing a full page before we release the parallel scan if
we detect that we may have overshot our previous skip scan's range with
the next startpoint not close in sight.

If we detect that a parallel worker in the same primscan range thinks
this is the right moment to start a new primitive scan, we don't release
the parallel scan immediately, but instead only release it after reading
the pages contents to find out if we really should start a new primitive
scan.  If so, we start that new primitive scan, and if instead we find
we've already skidded into the range of pages we would've arrived on with
the skip scan, we instead mark that the primitive scan has reached a new
primscan range, do some cleanup, and then continue the scan as usual.
---
 src/include/access/nbtree.h           |  14 ++-
 src/backend/access/nbtree/nbtree.c    | 126 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtsearch.c |  34 +++++--
 3 files changed, 151 insertions(+), 23 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..49675dbf3d8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1067,6 +1067,11 @@ typedef struct BTScanOpaqueData
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
+	/* local state for coordinating skips in parallel scans */
+	bool		testPrimScan;	/* Are we trying to do a new primitive scan */
+	uint32		arrElemsGen;	/* Generation number of prim scan we want to
+								 * improve on */
+
 	/* info about killed items if any (killedItems is NULL if never used) */
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
@@ -1215,9 +1220,12 @@ extern StrategyNumber bttranslatecmptype(CompareType cmptype, Oid opfamily);
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 							   BlockNumber *last_curr_page, bool first);
-extern void _bt_parallel_release(IndexScanDesc scan,
-								 BlockNumber next_scan_page,
-								 BlockNumber curr_page);
+extern void _bt_parallel_opt_release_early(IndexScanDesc scan,
+										   BlockNumber next_scan_page,
+										   BlockNumber scan_page);
+extern void _bt_parallel_opt_release_late(IndexScanDesc scan,
+										  BlockNumber next_scan_page,
+										  BlockNumber curr_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
 										   BlockNumber curr_page);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 765659887af..d282d8f5b18 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -72,6 +72,9 @@ typedef struct BTParallelScanDescData
 	BTPS_State	btps_pageStatus;	/* indicates whether next page is
 									 * available for scan. see above for
 									 * possible states of parallel scan. */
+	uint32		btps_arrElemsGen;	/* number of new prim scan opportunities */
+	bool		btps_checkPrimScan;	/* did we skid past the most opportune
+									 * endpoint of a primitive scan? */
 	LWLock		btps_lock;		/* protects shared parallel state */
 	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
 
@@ -356,6 +359,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->arrayKeys = NULL;
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
+	
+	so->testPrimScan = false;
+	so->arrElemsGen = 0;
 
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
@@ -731,6 +737,8 @@ btinitparallelscan(void *target)
 	bt_target->btps_nextScanPage = InvalidBlockNumber;
 	bt_target->btps_lastCurrPage = InvalidBlockNumber;
 	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrElemsGen = 0;
+	bt_target->btps_checkPrimScan = false;
 	ConditionVariableInit(&bt_target->btps_cv);
 }
 
@@ -757,13 +765,15 @@ btparallelrescan(IndexScanDesc scan)
 	btscan->btps_nextScanPage = InvalidBlockNumber;
 	btscan->btps_lastCurrPage = InvalidBlockNumber;
 	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrElemsGen = 0;
+	btscan->btps_checkPrimScan = false;
 	LWLockRelease(&btscan->btps_lock);
 }
 
 /*
  * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
- *		page.  Other scans must wait until we call _bt_parallel_release()
- *		or _bt_parallel_done().
+ *		page.  Other scans must wait until we call _bt_parallel_done(),
+ *		[_btp]_opt_release_early/late(), or [_btp]_primscan_schedule().
  *
  * The return value is true if we successfully seized the scan and false
  * if we did not.  The latter case occurs when no pages remain, or when
@@ -835,6 +845,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 		{
 			/* We're done with this parallel index scan */
 			status = false;
+			so->testPrimScan = false;
+			so->arrElemsGen = 0;
 		}
 		else if (btscan->btps_pageStatus == BTPARALLEL_IDLE &&
 				 btscan->btps_nextScanPage == P_NONE)
@@ -855,6 +867,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 				/* Restore scan's array keys from serialized values */
 				_bt_parallel_restore_arrays(rel, btscan, so);
 				exit_loop = true;
+				so->arrElemsGen = btscan->btps_arrElemsGen;
+				so->testPrimScan = false;
 			}
 			else
 			{
@@ -884,6 +898,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 			Assert(btscan->btps_nextScanPage != P_NONE);
 			*next_scan_page = btscan->btps_nextScanPage;
 			*last_curr_page = btscan->btps_lastCurrPage;
+
+			so->arrElemsGen = btscan->btps_arrElemsGen;
 			exit_loop = true;
 		}
 		LWLockRelease(&btscan->btps_lock);
@@ -901,7 +917,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 }
 
 /*
- * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ * _bt_parallel_opt_release_early() -- Complete the process of advancing the scan to a
  *		new page.  We now have the new value btps_nextScanPage; another backend
  *		can now begin advancing the scan.
  *
@@ -919,8 +935,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
  * scan can never change.
  */
 void
-_bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
-					 BlockNumber curr_page)
+_bt_parallel_opt_release_early(IndexScanDesc scan, BlockNumber next_scan_page,
+							   BlockNumber curr_page)
 {
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
@@ -931,11 +947,76 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
 												  parallel_scan->ps_offset_am);
 
 	LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
-	btscan->btps_nextScanPage = next_scan_page;
+	/*
+	 * If a parallel worker noticed that it had skipped past the end of a
+	 * primitive scan after another backend already acquired the parallel scan
+	 * status, we don't release the scan before reading the page's contents.
+	 * Instead, we transition to a position which will 
+	 */
+	if (likely(!btscan->btps_checkPrimScan))
+	{
+		btscan->btps_nextScanPage = next_scan_page;
+	btscan->btps_lastCurrPage = curr_page;
+		btscan->btps_pageStatus = BTPARALLEL_IDLE;
+		LWLockRelease(&btscan->btps_lock);
+		ConditionVariableSignal(&btscan->btps_cv);
+	}
+	else
+	{
+		BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+		so->testPrimScan = true;
+		LWLockRelease(&btscan->btps_lock);
+	}
+}
+
+/*
+ * _bt_parallel_opt_release_late() -- Complete the process of advancing the
+ *		scan to a new page.
+ *
+ * We're only called when a concurrent backend wanted to schedule a skip scan,
+ * but failed to do so because the parallel scan already advanced past its
+ * own page.
+ */
+void
+_bt_parallel_opt_release_late(IndexScanDesc scan, BlockNumber next_scan_page,
+							  BlockNumber curr_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan;
+
+	if (!so->testPrimScan)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset_am);
+
+	LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
+	Assert(btscan->btps_checkPrimScan);
 	btscan->btps_lastCurrPage = curr_page;
+	btscan->btps_nextScanPage = next_scan_page;
 	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+
+	/*
+	 * A late release implies that
+	 *
+	 * 1) a concurrent backend noticed we should've started a new primitive
+	 * scan, but that
+	 * 
+	 * 2) parallel workers already read far enough ahead that the current
+	 * scan position is already at (or past) the point where that next
+	 * primitive index scan would've positioned itself.
+	 *
+	 * So, we do what executing a primitive scan would've done with the shared
+	 * state: we increase the generation number, unset checkPrimScan, and
+	 * continue on as normal.
+	 */
+	btscan->btps_checkPrimScan = false;
+	btscan->btps_arrElemsGen += 1;
 	LWLockRelease(&btscan->btps_lock);
 	ConditionVariableSignal(&btscan->btps_cv);
+
+	so->testPrimScan = false;
 }
 
 /*
@@ -952,6 +1033,7 @@ _bt_parallel_done(IndexScanDesc scan)
 	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
+	so->testPrimScan = false;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -990,10 +1072,10 @@ _bt_parallel_done(IndexScanDesc scan)
 /*
  * _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
  *
- * Caller passes the curr_page most recently passed to _bt_parallel_release
- * by its backend.  Caller successfully schedules the next primitive index scan
- * if the shared parallel state hasn't been seized since caller's backend last
- * advanced the scan.
+ * Caller passes the curr_page most recently passed to
+ * _bt_parallel_opt_release_early by its backend.  Caller successfully
+ * schedules the next primitive index scan if the shared parallel state hasn't
+ * been seized since caller's backend last advanced the scan.
  */
 void
 _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page)
@@ -1009,17 +1091,37 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page)
 												  parallel_scan->ps_offset_am);
 
 	LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
-	if (btscan->btps_lastCurrPage == curr_page &&
-		btscan->btps_pageStatus == BTPARALLEL_IDLE)
+	if ((btscan->btps_lastCurrPage == curr_page &&
+		 btscan->btps_pageStatus == BTPARALLEL_IDLE) ||
+		unlikely(so->testPrimScan))
 	{
+		/*
+		 * When called with so->testPrimScan, we had not yet released the page
+		 * for parallel work, so the page state should still be ADVANCING.
+		 */
+		Assert(!so->testPrimScan || btscan->btps_pageStatus == BTPARALLEL_ADVANCING);
 		btscan->btps_nextScanPage = InvalidBlockNumber;
 		btscan->btps_lastCurrPage = InvalidBlockNumber;
 		btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
+		btscan->btps_arrElemsGen += 1;
 
 		/* Serialize scan's current array keys */
 		_bt_parallel_serialize_arrays(rel, btscan, so);
 	}
+	/*
+	 * If the shared array keys are still those of the primitive scan we used
+	 * to access prev_scan_page, make a note that the next page may be a good
+	 * opportunity to start a new primitive scan.  Once marked, a worker will
+	 * not release the scan until it has processed its page and knows for
+	 * sure whether a new prim scan is needed.
+	 */
+	else if (btscan->btps_arrElemsGen == so->arrElemsGen)
+	{
+		btscan->btps_checkPrimScan = true;
+	}
 	LWLockRelease(&btscan->btps_lock);
+
+	so->testPrimScan = false;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a3886913..2c3294f9cc7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1585,7 +1585,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  *
  * In the case of a parallel scan, caller must have called _bt_parallel_seize
  * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
+ * _bt_parallel_opt_release_early before returning.
  *
  * Returns true if any matching items found on the page, false if none.
  */
@@ -1619,11 +1619,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_opt_release_early(scan, so->currPos.nextPage,
+										   so->currPos.currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_opt_release_early(scan, so->currPos.prevPage,
+										   so->currPos.currPage);
 	}
 
 	/* initialize remaining currPos fields related to current page */
@@ -1993,6 +1993,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
+	/*
+	 * If !continuescan, releasing will be (or has been) done by either
+	 * _bt_p*_done or _bt_p*_primscan_schedule.
+	 */
+	if (scan->parallel_scan && pstate.continuescan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_opt_release_late(scan, so->currPos.nextPage,
+										  so->currPos.currPage);
+		else
+			_bt_parallel_opt_release_late(scan, so->currPos.prevPage,
+										  so->currPos.currPage);
+	}
+
 	return (so->currPos.firstItem <= so->currPos.lastItem);
 }
 
@@ -2213,7 +2227,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * success or failure; we'll call _bt_parallel_opt_release_early as soon as
+ * possible.
  */
 static bool
 _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
@@ -2301,7 +2316,8 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * locks and pins, invalidate so->currPos, and return false.
  *
  * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * success or failure; we'll call _bt_parallel_opt_release_early as soon as
+ * possible.
  */
 static bool
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
@@ -2398,8 +2414,10 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 				blkno = opaque->btpo_next;
 			else
 				blkno = opaque->btpo_prev;
+
+			/* allow next page be processed by parallel worker */
 			if (scan->parallel_scan != NULL)
-				_bt_parallel_release(scan, blkno, lastcurrblkno);
+				_bt_parallel_opt_release_early(scan, blkno, lastcurrblkno);
 		}
 
 		/* no matching tuples on this page */
-- 
2.45.2

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Matthias van de Meent (#5)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Thu, Oct 17, 2024 at 5:03 AM Matthias van de Meent
<boekewurm@gmail.com> wrote:

On Thu, 17 Oct 2024 at 00:33, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
<boekewurm@gmail.com> wrote:

In v17 and the master branch you'll note 16 buffer hits for the test
query. However, when we use more expensive btree compare operations
(e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
buffer access count starts to vary a lot and skyrockets to 30+ on my
machine, in some cases reaching >100 buffer hits. After applying my
patch, the buffer access count is capped to a much more agreeable
16-18 hits - it still shows signs of overshooting the serial bounds,
but the number of buffers we overshoot our target is capped and thus
significantly lower.

It's not exactly capped, though. Since in any case you're always prone
to getting extra leaf page reads at the end of each primitive index
scan. That's not something that's new to Postgres 17, though.

True, but the SAOP-enabled continued overshoot _is_ new: previously,
each backend would only get up to one additional buffer access for
every SOAP scan entry, while now it's only limited by outer SAOP
bounds and index size.

IIUC, the problem you are trying to solve is that we will miss
primitive index scans in case of parallel index scans. The problem can
happen when the second parallel backend process (say b-2) has seized
the page after the first parallel backend (say b-1) has released the
current page and before b-1 could check whether it can schedule
another primitive index scan. And you want to fix it by delaying the
release of current page by b-1 in some cases, if so, then it has some
downsides as well as pointed out by Peter. Is my understanding
correct? If so, then we may also need to prove that releasing the page
at a later point doesn't harm any other cases.

I am not sure if I understood the problem completely, so can you
please explain it in a bit more layman's language, how it works before
17 and after 17?

BTW, I also want to clarify my understanding of primitive index scans
and how it has changed in PG17, is it related to how we optimize SAOP
scans by reducing the number of leaf page scans?

--
With Regards,
Amit Kapila.

#8Matthias van de Meent
boekewurm@gmail.com
In reply to: Amit Kapila (#7)
Re: Limiting overshoot in nbtree's parallel SAOP index scans

On Wed, 14 May 2025 at 08:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 17, 2024 at 5:03 AM Matthias van de Meent
<boekewurm@gmail.com> wrote:

On Thu, 17 Oct 2024 at 00:33, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
<boekewurm@gmail.com> wrote:

In v17 and the master branch you'll note 16 buffer hits for the test
query. However, when we use more expensive btree compare operations
(e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
buffer access count starts to vary a lot and skyrockets to 30+ on my
machine, in some cases reaching >100 buffer hits. After applying my
patch, the buffer access count is capped to a much more agreeable
16-18 hits - it still shows signs of overshooting the serial bounds,
but the number of buffers we overshoot our target is capped and thus
significantly lower.

It's not exactly capped, though. Since in any case you're always prone
to getting extra leaf page reads at the end of each primitive index
scan. That's not something that's new to Postgres 17, though.

True, but the SAOP-enabled continued overshoot _is_ new: previously,
each backend would only get up to one additional buffer access for
every SOAP scan entry, while now it's only limited by outer SAOP
bounds and index size.

IIUC, the problem you are trying to solve is that we will miss
primitive index scans in case of parallel index scans.

Correct - we may continuously miss the opportunity to start a new
primitive index scan when processing the contents of a page takes
longer than the time it takes for another backend to get woken up and
acquire the shared scan state.

The problem can
happen when the second parallel backend process (say b-2) has seized
the page after the first parallel backend (say b-1) has released the
current page and before b-1 could check whether it can schedule
another primitive index scan.

Correct.

And you want to fix it by delaying the
release of current page by b-1 in some cases, if so, then it has some
downsides as well as pointed out by Peter. Is my understanding
correct?

Yes, exactly.

If so, then we may also need to prove that releasing the page
at a later point doesn't harm any other cases.

There is no way to prove it won't harm other cases, as this is based
on heuristics and assumptions. If the assumptions

The assumption behind the patch is that if you've missed the
opportinity to start a new primitive index scan twice in a row, then
it's likely you'll miss the same chance the next time too, and the one
after that.

Furthermore, the skip scan architecture itself is based on heuristics
where the assumption is that starting a new primitive scan is cheaper
than scanning the index to the next matching element.

Finally, missing the opportunity to start a new skip scan requires
that at least 1 other worker is in approximately the same state as we
are: the other backend that continued the parallel index scan, which
made us miss the opportunity to start a new primitive index scan.

In this case, we know the parallel scan has already spent 2 page
scans' worth of time waiting for a new primitive scan. If our first
assumption holds, then by keeping the parallel scan we'll only block
the other backends from doing work that is wasteful for at most for
the duration of processing one page. If the assumption is false, then
we'll have lost the time of at most n_workers backends for that same
duration - workers that probably had pages with data they still had to
process.

I am not sure if I understood the problem completely, so can you
please explain it in a bit more layman's language, how it works before
17 and after 17?

Before PG17, every IN()-list item had its own primitive index scan
(after deduplication of the elements).
This caused excess primitive scans for elements that were on the
same page (hence the change in PG17), but guaranteed separate
primitive scans for elements that were far apart, limiting pages
accessed to at most [SAOP scan count * (pages_with_values_per_primscan
+ n_workers)].

Starting with PG17, the index scan is (in principle) a single
primitive scan, unless 1.) a backend finds that a new primitive scan
is needed, and 2.) that backend can signal the shared state about this
before the scan state can be acquired by another worker.
This guarantees we don't use a separate primitive scan for every
element in the scan when they're located on the same index pages, but
could mean we'd scan the whole index when someone issued an index scan
for = ANY(INT_MIN, INT_MAX) if the 2nd condition for starting a new
primitive scan is always false.

BTW, I also want to clarify my understanding of primitive index scans
and how it has changed in PG17, is it related to how we optimize SAOP
scans by reducing the number of leaf page scans?

Exactly, yes.

I hope this clarified a few items.

Kind regards,

Matthias van de Meent
Databricks