From 167a36a6f38383a493cea88ba574a498e4b37dce Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v6 2/2] Remove the buffer_strategy_lock and make the clock
 hand a 64 bit atomic

Change nextVictimBuffer to an atomic uint64 and simply atomically
increment it by 1 at each tick.  The next victim buffer is the the value
of nextVictimBuffer modulo the number of buffers (NBuffers).  Modulo can
be expensive so we implement that as if the value of NBuffers was
requied to be a power of 2 and account for the difference.  The value of
nextVictimBuffer, because it is only ever incremented, now encodes
enough information to provide the number of completed passes of the
clock-sweep algorithm as well.  This eliminates the need for a separate
counter and related maintainance.  While wrap-around of nextVictimBuffer
would require at least 200 years on today's hardware, should that happen
BgBuferSync will properly determine the delta of passes.

With the removal of the freelist and completePasses none of remaining
items in the BufferStrategyControl structure require strict coordination
and so it is possible to eliminate the buffer_strategy_lock as well.
---
 src/backend/storage/buffer/README     |  48 ++++---
 src/backend/storage/buffer/bufmgr.c   |  20 ++-
 src/backend/storage/buffer/freelist.c | 176 +++++++++++++-------------
 src/backend/storage/buffer/localbuf.c |   2 +-
 src/include/storage/buf_internals.h   |   4 +-
 5 files changed, 131 insertions(+), 119 deletions(-)

diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..d1ab222eeb8 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement.  A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held.  This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
 concurrency.
 
 * Each buffer header contains a spinlock that must be taken when examining
@@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable.
 Normal Buffer Replacement Strategy
 ----------------------------------
 
-To choose a victim buffer to recycle when there are no free buffers available,
-we use a simple clock-sweep algorithm, which avoids the need to take
-system-wide locks during common operations.  It works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+which avoids the need to take system-wide locks during common operations.  It
+works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the
 buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers.  nextVictimBuffer is protected by the
-buffer_strategy_lock.
+through all the available buffers.  nextVictimBuffer and completePasses are
+atomic values.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
 
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used.  Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
 
 4. Pin the selected buffer, and return.
 
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
 When running a query that needs to access a large number of pages just once,
 such as VACUUM or a large sequential scan, a different strategy is used.
 A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
 blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
 whole scan.  This also implies that much of the write traffic caused by such
 a statement will be done by the backend itself and not pushed off onto other
 processes.
@@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
 dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit.  Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers.  (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so it needs only to
+spinlock each buffer header for long enough to check the dirtybit.  Even
+without that assumption, the writer only needs to take the lock long enough to
+read the variable value, not while scanning the buffers. (This is a very
+substantial improvement in the contention cost of the writer compared to PG
+8.0.)
 
 The background writer takes shared content lock on a buffer while writing it
 out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af5ef025229..0be6f4d8c80 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
  * This is called periodically by the background writer process.
  *
  * Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode.  (This happens if the strategy clock sweep
+ * low-power hibernation mode.  (This happens if the strategy clock-sweep
  * has been "lapped" and no buffer allocations have occurred recently,
  * or if the bgwriter has been effectively disabled by setting
  * bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
 	uint32		new_recent_alloc;
 
 	/*
-	 * Find out where the clock sweep currently is, and how many buffer
+	 * Find out where the clock-sweep currently is, and how many buffer
 	 * allocations have happened since our last call.
 	 */
 	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
 
 	/*
 	 * Compute strategy_delta = how many buffers have been scanned by the
-	 * clock sweep since last time.  If first time through, assume none. Then
-	 * see if we are still ahead of the clock sweep, and if so, how many
+	 * clock-sweep since last time.  If first time through, assume none. Then
+	 * see if we are still ahead of the clock-sweep, and if so, how many
 	 * buffers we could scan before we'd catch up with it and "lap" it. Note:
 	 * weird-looking coding of xxx_passes comparisons are to avoid bogus
 	 * behavior when the passes counts wrap around.
 	 */
 	if (saved_info_valid)
 	{
-		int32		passes_delta = strategy_passes - prev_strategy_passes;
+		int32		passes_delta;
+
+		if (unlikely(prev_strategy_passes > strategy_passes))
+		{
+			/* wrap-around case */
+			passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
+		}
+		else
+		{
+			passes_delta = (int32) (strategy_passes - prev_strategy_passes);
+		}
 
 		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
 		strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 162c140fb9d..0b49d178362 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include <math.h>
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -29,21 +30,17 @@
  */
 typedef struct
 {
-	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
-
 	/*
-	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
-	 * this isn't a concrete buffer - we only ever increase the value. So, to
-	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 * This is used as both the clock-sweep hand and the number of of complete
+	 * passes through the buffer pool.  The lower bits below NBuffers are the
+	 * clock-sweep and the upper bits are the number of complete passes.
 	 */
-	pg_atomic_uint32 nextVictimBuffer;
+	pg_atomic_uint64 nextVictimBuffer;
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
 	 */
-	uint32		completePasses; /* Complete cycles of the clock sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
@@ -83,12 +80,71 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
 }			BufferAccessStrategyData;
 
+static uint32 NBuffersPow2Mask; /* Next power-of-2 >= NBuffers - 1 */
+static uint32 NBuffersPow2Shift;	/* Amount to bitshift for division */
+static uint32 NBuffersPerCycle; /* Number of buffers in a complete cycle */
 
 /* Prototypes for internal functions */
 static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
+static inline uint32 clock_passes(uint64 hand);
+static inline uint32 clock_read(uint64 hand);
+
+ /*
+  * Calculate the number of complete passes through the buffer pool that have
+  * happened thus far.  A "pass" is defined as the clock hand moving through
+  * all the buffers (NBuffers) in the pool once.  Our clock hand is a 64-bit
+  * counter that only increases. The number of passes is the upper bits of the
+  * counter divided by NBuffers.
+  */
+static inline uint32
+clock_passes(uint64 hand)
+{
+	uint32		result;
+
+	/* Calculate complete next power-of-2 cycles by bitshifting */
+	uint64		pow2_passes = hand >> NBuffersPow2Shift;
+
+	/* Determine the hand's current position in the cycle */
+	uint64		masked_hand = hand & NBuffersPow2Mask;
+
+	/* Has the hand passed NBuffers yet? */
+	uint32		extra_passes = (masked_hand >= NBuffers) ? 1 : 0;
+
+	/*
+	 * Combine total passes, multiply complete power-of-2 cycles by passes
+	 * per-cycle, then add any extra pass from the current incomplete cycle.
+	 */
+	result = (uint32) (pow2_passes * NBuffersPerCycle) + extra_passes;
+
+	Assert(result <= UINT32_MAX);
+	Assert(result == ((uint32) (hand / NBuffers)));
+
+	return result;
+}
+
+ /*
+  * The hand's value is a 64-bit counter that only increases, so its position
+  * is determined by the lower bits of the counter modulo by NBuffers.  To
+  * avoid the modulo operation we use the next power-of-2 mask and adjust for
+  * the difference.
+  */
+static inline uint32
+clock_read(uint64 hand)
+{
+	/* Determine the hand's current position in the cycle */
+	uint64		result = (uint32) hand & NBuffersPow2Mask;
+
+	/* Adjust if the next power of 2 masked counter is more than NBuffers */
+	if (result >= NBuffers)
+		result -= NBuffers;
+
+	Assert(result == (uint32) (hand % NBuffers));
+
+	return result;
+}
 
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -99,6 +155,7 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
 static inline uint32
 ClockSweepTick(void)
 {
+	uint64		hand;
 	uint32		victim;
 
 	/*
@@ -106,52 +163,11 @@ ClockSweepTick(void)
 	 * doing this, this can lead to buffers being returned slightly out of
 	 * apparent order.
 	 */
-	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+	hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
+	victim = clock_read(hand);
 
-	if (victim >= NBuffers)
-	{
-		uint32		originalVictim = victim;
-
-		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
+	Assert(victim < NBuffers);
 
-		/*
-		 * If we're the one that just caused a wraparound, force
-		 * completePasses to be incremented while holding the spinlock. We
-		 * need the spinlock so StrategySyncStart() can return a consistent
-		 * value consisting of nextVictimBuffer and completePasses.
-		 */
-		if (victim == 0)
-		{
-			uint32		expected;
-			uint32		wrapped;
-			bool		success = false;
-
-			expected = originalVictim + 1;
-
-			while (!success)
-			{
-				/*
-				 * Acquire the spinlock while increasing completePasses. That
-				 * allows other readers to read nextVictimBuffer and
-				 * completePasses in a consistent manner which is required for
-				 * StrategySyncStart().  In theory delaying the increment
-				 * could lead to an overflow of nextVictimBuffers, but that's
-				 * highly unlikely and wouldn't be particularly harmful.
-				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
-				wrapped = expected % NBuffers;
-
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
-														 &expected, wrapped);
-				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-			}
-		}
-	}
 	return victim;
 }
 
@@ -193,10 +209,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 
 	*from_ring = false;
 
-	/*
-	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need buffer_strategy_lock.
-	 */
+	/* If given a strategy object, see whether it can select a buffer */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
@@ -241,7 +254,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 */
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
-	/* Use the "clock sweep" algorithm to find a free buffer */
+	/* Use the "clock-sweep" algorithm to find a free buffer */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -297,32 +310,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * allocs if non-NULL pointers are passed.  The alloc count is reset after
  * being read.
  */
-int
+uint32
 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
-	uint32		nextVictimBuffer;
-	int			result;
+	uint64		counter;
+	uint32		result;
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+	result = clock_read(counter);
 
 	if (complete_passes)
 	{
-		*complete_passes = StrategyControl->completePasses;
-
-		/*
-		 * Additionally add the number of wraparounds that happened before
-		 * completePasses could be incremented. C.f. ClockSweepTick().
-		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
+		*complete_passes = clock_passes(counter);
 	}
 
 	if (num_buf_alloc)
 	{
 		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
 	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
 	return result;
 }
 
@@ -337,21 +343,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 void
 StrategyNotifyBgWriter(int bgwprocno)
 {
-	/*
-	 * We acquire buffer_strategy_lock just to ensure that the store appears
-	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
-	 * infrequently, so there's no performance penalty from being safe.
-	 */
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	StrategyControl->bgwprocno = bgwprocno;
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
 
 /*
  * StrategyShmemSize
  *
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
  *
  * Note: for somewhat historical reasons, the buffer lookup hashtable size
  * is also determined here.
@@ -404,18 +403,25 @@ StrategyInitialize(bool init)
 
 	if (!found)
 	{
+		uint32		NBuffersPow2;
+
 		/*
 		 * Only done once, usually in postmaster
 		 */
 		Assert(init);
 
-		SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
-		/* Initialize the clock sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize combined clock-sweep pointer/complete passes counter */
+		pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
+		/* Find the smallest power of 2 larger than NBuffers */
+		NBuffersPow2 = pg_nextpower2_32(NBuffers);
+		/* Using that, find the number of positions to shift for division */
+		NBuffersPow2Shift = pg_leftmost_one_pos32(NBuffersPow2);
+		/* Calculate passes per power-of-2, typically 1 or 2 */
+		NBuffersPerCycle = NBuffersPow2 / NBuffers;
+		/* The bitmask to extract the lower portion of the clock */
+		NBuffersPow2Mask = NBuffersPow2 - 1;
 
 		/* Clear statistics */
-		StrategyControl->completePasses = 0;
 		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
 
 		/* No pending notification */
@@ -659,7 +665,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	 *
 	 * If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
 	 * since our own previous usage of the ring element would have left it
-	 * there, but it might've been decremented by clock sweep since then). A
+	 * there, but it might've been decremented by clock-sweep since then). A
 	 * higher usage_count indicates someone else has touched the buffer, so we
 	 * shouldn't re-use it.
 	 */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/*
-	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
+	 * Need to get a new buffer.  We use a clock-sweep algorithm (essentially
 	 * the same as what freelist.c does now...)
 	 */
 	trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index d4449e11384..f2283ea8e22 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
  * accuracy and speed of the clock-sweep buffer management algorithm.  A
  * large value (comparable to NBuffers) would approximate LRU semantics.
  * But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
  * value to be very large.
  */
 #define BM_MAX_USAGE_COUNT	5
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
-- 
2.49.0