[PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Hello,
In conversations [1]/messages/by-id/ndvygkpdx44pmi4xbkf52gfrl77cohpefr42tipvd5dgiaeuyd@fe2og2kxyjnc recently about considering how best to adapt the code to become NUMA-aware Andres commented, "FWIW, I've started to wonder if we shouldn't just get rid of the freelist entirely" and because I'm a glutton for punishment (and I think this idea has some merit) I took him up on this task.
In freelist.c the function StrategyGetBuffer() currently tries first to use the BufferAccessStrategy, if present, via GetBufferFromRing(). Failing that, the second step is to check the "freelist" as defined by StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains any available buffers, and finally it will "Use the clock-sweep algorithm to find a free buffer." The freelist was intended to be "a list of buffers that are prime candidates for replacement" but I question the value of that versus the overhead of managing it. Without the list some operations are likely (I plan to measure this) faster due, as Anders points out in [1]/messages/by-id/ndvygkpdx44pmi4xbkf52gfrl77cohpefr42tipvd5dgiaeuyd@fe2og2kxyjnc, "just using clock sweep actually makes things like DROP TABLE perform better because it doesn't need to maintain the freelist anymore." It may be the case that with very large NBuffers where most are in use that this approach is slower (I plan to measure this too), but in those cases I'd imagine the freelist is likely empty and so the code will use the clock-sweep algorithm anyway, so I'm not sure there is a performance penalty at all.
This change does remove the have_free_buffer() function used by the contrib/pg_prewarm module. On the surface this doesn't seem to cause any issues, but honestly I've not thought too deeply on this one.
v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.
Once removed [2]https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v2 and tests passing [3]https://github.com/gburd/postgres/pull/4/checks I took a long hard look at the buffer_strategy_lock that used to serialize concurrent access to members of BufferStrategyControl and I couldn't find a good reason to keep it around. Let's review what it is guarding:
completePasses: a count of the number of times the clock-sweep hand wraps around. StrategySyncStart() provides this to the bgwriter which in turn uses it to compute a strategic location at which to start scanning for pages to evict. There's an interesting comment that indicates both a "requirement" and an equivocal "but that's highly unlikely and wouldn't be particularly harmful" statement conflicting with itself. I tried to find a reason that nextVictimBuffers could overflow or that the use of the completePasses value could somehow cause harm if off by one or more in the bgwriter and either I missed it (please tell me) or there isn't one. However, it does make sense to change completePasses into an atomic value so that it is consistent across backends and in the bgwriter.
bgwprocno: when not -1 is the PID of the allocation notification latch (ProcGlobal->allProcs[bgwprocno].procLatch). This is a "power savings" feature where the goal is to signal the bgwriter "When a backend starts using buffers again, it will wake us up by setting our latch." Here the code reads, "Since we don't want to rely on a spinlock for this we force a read from shared memory once, and then set the latch based on that value." and uses INT_ACCESS_ONCE() to read the value and set the latch. The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no reason to use atomic or other synchronization here.
And that's all there is to it now that the freelist is gone. As a result, IMO it seems unnecessary to require a spin lock for access to BufferStrategyControl.
v2-0002 Remove the buffer_strategy_lock.
This attached patch is also a branch on GitHub [2]https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v2 if that is of interest or helpful, it passes check world [3]https://github.com/gburd/postgres/pull/4/checks (I use the GitHub PR to trigger CirrusCI tests, not as the way to convey the change set).
I also made a few minor changes such that we're consistently referring to "clock-sweep" (not "clock sweep" or "clocksweep"), I'm not wedded to those but consistency isn't a bad thing, right?
As an aside, we're really implementing "generalized CLOCK" [4]V. F. Nicola, A. Dan, and D. M. Dias, "Analysis of the Generalized Clock Buffer Replacement Scheme for Database Transaction Processing", Proceeding of 1992 ACM SIGMETRICS Conference, June 1992, pp. 35-46.[5]A. J. Smith, "Sequentiality and Prefetching in Database Systems", ACM Trans. on Database Systems, Vol. 3, No. 3, 1978, pp. 223-247. which uses counters rather a single bit as pointed out [6]"In a generalized CLOCK version called GCLOCK [25,17], a counter is associated with each page rather than a single bit. Its counter will be incremented if a page is hit. The cycling clock hand sweeps over the pages decrementing their counters until a page whose counter is zero is found for replacement." in the CLOCK-Pro [7]CLOCK-Pro: An Effective Improvement of the CLOCK Replacement https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang.pdf paper, but I digress.
I'd like to hear ideas for worst cases to test and/or benchmark. I plan on attempting what I saw Michael do with flamegraphs before/after/delta in a follow up to this. If people feel this has merit I'll add it to CF/2.
thank you for your time considering this idea,
-greg
[1]: /messages/by-id/ndvygkpdx44pmi4xbkf52gfrl77cohpefr42tipvd5dgiaeuyd@fe2og2kxyjnc
[2]: https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v2
[3]: https://github.com/gburd/postgres/pull/4/checks
[4]: V. F. Nicola, A. Dan, and D. M. Dias, "Analysis of the Generalized Clock Buffer Replacement Scheme for Database Transaction Processing", Proceeding of 1992 ACM SIGMETRICS Conference, June 1992, pp. 35-46.
[5]: A. J. Smith, "Sequentiality and Prefetching in Database Systems", ACM Trans. on Database Systems, Vol. 3, No. 3, 1978, pp. 223-247.
[6]: "In a generalized CLOCK version called GCLOCK [25,17], a counter is associated with each page rather than a single bit. Its counter will be incremented if a page is hit. The cycling clock hand sweeps over the pages decrementing their counters until a page whose counter is zero is found for replacement."
[7]: CLOCK-Pro: An Effective Improvement of the CLOCK Replacement https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang.pdf
Attachments:
v2-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patc?= =?UTF-8?Q?h?="Download
From ffe34b140b850d67bbb98b04b36cd5fedec76c29 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v2 1/2] Eliminate the freelist from the buffer manager and
depend on clocksweep.
This set of changes removes the list of available buffers and instead simply
uses the clocksweep algorithm to find and return an available buffer. While on
the surface this appears to be removing an optimization it is in fact
eliminating code that induces overhead in the form of synchronization that is
problemmatic for multi-core systems. This change removes the
have_free_buffer() function that was used in the pg_prewarm module.
---
contrib/pg_prewarm/autoprewarm.c | 16 +---
src/backend/storage/buffer/README | 42 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +-----
src/backend/storage/buffer/freelist.c | 126 +-------------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 26 insertions(+), 208 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..68f21d94473 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -410,10 +410,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +458,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -526,7 +516,7 @@ autoprewarm_database_main(Datum main_arg)
* Loop until we run out of blocks to prewarm or until we run out of free
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -574,8 +564,8 @@ autoprewarm_database_main(Datum main_arg)
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
+ /* have_free_buffer()) */
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..cd52effd911 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle when there are no free buffers available,
+we use a simple clock-sweep algorithm, which avoids the need to take
+system-wide locks during common operations. It works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
@@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bd68d7e0ca9..9c059441a5c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2262,11 +2255,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..a228ff27377 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -163,23 +155,6 @@ ClockSweepTick(void)
return victim;
}
-/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
- *
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
- */
-bool
-have_free_buffer(void)
-{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -243,75 +218,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We keep an approximate count of buffer allocation requests so that the
+ * bgwriter can estimate the rate of buffer consumption. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +270,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +395,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..00eade63971 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
v2-0002-Remove-the-buffer_strategy_lock.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0002-Remove-the-buffer=5Fstrategy=5Flock.patch?="Download
From be039fdf42ba3bf6637697f6fec38874f94d09b6 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v2 2/2] Remove the buffer_strategy_lock.
With the removal of the freelist the remaining items in the
BufferStrategyControl structure no longer require strict coordination. Atomic
operations will suffice.
---
src/backend/storage/buffer/README | 39 +++++++++----------
src/backend/storage/buffer/bufmgr.c | 8 ++--
src/backend/storage/buffer/freelist.c | 56 ++++++---------------------
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 2 +-
5 files changed, 36 insertions(+), 71 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..a60f77d7ee9 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -173,14 +172,12 @@ buffer_strategy_lock.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
4. Pin the selected buffer, and return.
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
@@ -244,13 +241,13 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so that the writer
+doesn't even need to take buffer_strategy_lock in order to look for buffers to
+write; it needs only to spinlock each buffer header for long enough to check
+the dirtybit. Even without that assumption, the writer only needs to take the
+lock long enough to read the variable value, not while scanning the buffers.
+(This is a very substantial improvement in the contention cost of the writer
+compared to PG 8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9c059441a5c..d068f77362d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the clock sweep currently is, and how many buffer
+ * Find out where the clock-sweep currently is, and how many buffer
* allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,8 +3664,8 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index a228ff27377..267a9d84df3 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,11 +29,8 @@
*/
typedef struct
{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
+ * Clock-sweep hand: index of next buffer to consider grabbing. Note that
* this isn't a concrete buffer - we only ever increase the value. So, to
* get an actual buffer, it needs to be used modulo NBuffers.
*/
@@ -43,7 +40,7 @@ typedef struct
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
+ pg_atomic_uint32 completePasses; /* Complete cycles of the clock-sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -116,12 +113,7 @@ ClockSweepTick(void)
/* always wrap what we look up in BufferDescriptors */
victim = victim % NBuffers;
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
+ /* Increment completePasses if we just caused a wraparound */
if (victim == 0)
{
uint32 expected;
@@ -132,23 +124,12 @@ ClockSweepTick(void)
while (!success)
{
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
wrapped = expected % NBuffers;
success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
&expected, wrapped);
if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ pg_atomic_fetch_add_u32(&StrategyControl->completePasses, 1);
}
}
}
@@ -177,10 +158,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -225,7 +203,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock sweep" algorithm to find a free buffer */
+ /* Use the "clock-sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -287,13 +265,12 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
uint32 nextVictimBuffer;
int result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
result = nextVictimBuffer % NBuffers;
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
+ *complete_passes = pg_atomic_read_u32(&StrategyControl->completePasses);
/*
* Additionally add the number of wraparounds that happened before
@@ -306,7 +283,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -321,21 +298,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
* StrategyShmemSize
*
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
*
* Note: for somewhat historical reasons, the buffer lookup hashtable size
* is also determined here.
@@ -393,13 +363,11 @@ StrategyInitialize(bool init)
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock sweep pointer */
+ /* Initialize the clock-sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
/* Clear statistics */
- StrategyControl->completePasses = 0;
+ pg_atomic_init_u32(&StrategyControl->completePasses, 0);
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -643,7 +611,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 00eade63971..133a0dd7fd5 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
--
2.49.0
On Fri, Jul 11, 2025 at 01:26:53PM -0400, Greg Burd wrote:
This change does remove the have_free_buffer() function used by the
contrib/pg_prewarm module. On the surface this doesn't seem to cause any
issues, but honestly I've not thought too deeply on this one.
Hm. ISTM we'll either need to invent another similarly inexpensive way to
test for this or to justify to ourselves that it's not necessary. My guess
is that we do want to keep autoprewarm from evicting things, but FWIW the
docs already say "prewarming may also evict other data from cache" [0]https://www.postgresql.org/docs/devel/pgprewarm.html.
Once removed [2] and tests passing [3] I took a long hard look at the
buffer_strategy_lock that used to serialize concurrent access to members
of BufferStrategyControl and I couldn't find a good reason to keep it
around. Let's review what it is guarding:completePasses: a count of the number of times the clock-sweep hand wraps
around. StrategySyncStart() provides this to the bgwriter which in turn
uses it to compute a strategic location at which to start scanning for
pages to evict. There's an interesting comment that indicates both a
"requirement" and an equivocal "but that's highly unlikely and wouldn't
be particularly harmful" statement conflicting with itself. I tried to
find a reason that nextVictimBuffers could overflow or that the use of
the completePasses value could somehow cause harm if off by one or more
in the bgwriter and either I missed it (please tell me) or there isn't
one. However, it does make sense to change completePasses into an atomic
value so that it is consistent across backends and in the bgwriter.bgwprocno: when not -1 is the PID of the allocation notification latch
(ProcGlobal->allProcs[bgwprocno].procLatch). This is a "power savings"
feature where the goal is to signal the bgwriter "When a backend starts
using buffers again, it will wake us up by setting our latch." Here the
code reads, "Since we don't want to rely on a spinlock for this we force
a read from shared memory once, and then set the latch based on that
value." and uses INT_ACCESS_ONCE() to read the value and set the latch.
The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no
reason to use atomic or other synchronization here.And that's all there is to it now that the freelist is gone. As a
result, IMO it seems unnecessary to require a spin lock for access to
BufferStrategyControl.
I haven't followed your line of reasoning closely yet, but I typically
recommend that patches that replace locks with atomics use functions with
full barrier semantics (e.g., pg_atomic_read_membarrier_u32(),
pg_atomic_fetch_add_u32()) to make things easier to reason about. But that
might not be as straightforward in cases like StrategySyncStart() where we
atomically retrieve two values that are used together. Nevertheless,
minimizing cognitive load might be nice, and there's a chance it doesn't
impact performance very much.
[0]: https://www.postgresql.org/docs/devel/pgprewarm.html
--
nathan
Hi,
On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
In conversations [1] recently about considering how best to adapt the code
to become NUMA-aware Andres commented, "FWIW, I've started to wonder if we
shouldn't just get rid of the freelist entirely" and because I'm a glutton
for punishment (and I think this idea has some merit) I took him up on this
task.
In freelist.c the function StrategyGetBuffer() currently tries first to use
the BufferAccessStrategy, if present, via GetBufferFromRing(). Failing
that, the second step is to check the "freelist" as defined by
StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains
any available buffers, and finally it will "Use the clock-sweep algorithm to
find a free buffer." The freelist was intended to be "a list of buffers
that are prime candidates for replacement" but I question the value of that
versus the overhead of managing it. Without the list some operations are
likely (I plan to measure this) faster due, as Anders points out in [1],
"just using clock sweep actually makes things like DROP TABLE perform better
because it doesn't need to maintain the freelist anymore." It may be the
case that with very large NBuffers where most are in use that this approach
is slower (I plan to measure this too), but in those cases I'd imagine the
freelist is likely empty and so the code will use the clock-sweep algorithm
anyway, so I'm not sure there is a performance penalty at all.This change does remove the have_free_buffer() function used by the
contrib/pg_prewarm module. On the surface this doesn't seem to cause any
issues, but honestly I've not thought too deeply on this one.
I think we'll likely need something to replace it.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop prewarming
shared buffers if doing so would just evict buffers. But it's not clear to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? It'd make more sense to me to
stop autoprewarm once NBuffers have been prewarmed...
v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.
Once removed [2] and tests passing [3] I took a long hard look at the
buffer_strategy_lock that used to serialize concurrent access to members of
BufferStrategyControl and I couldn't find a good reason to keep it around.
Let's review what it is guarding:completePasses: a count of the number of times the clock-sweep hand wraps
around. StrategySyncStart() provides this to the bgwriter which in turn
uses it to compute a strategic location at which to start scanning for pages
to evict. There's an interesting comment that indicates both a
"requirement" and an equivocal "but that's highly unlikely and wouldn't be
particularly harmful" statement conflicting with itself. I tried to find a
reason that nextVictimBuffers could overflow or that the use of the
completePasses value could somehow cause harm if off by one or more in the
bgwriter and either I missed it (please tell me) or there isn't one.
However, it does make sense to change completePasses into an atomic value so
that it is consistent across backends and in the bgwriter.
I don't think it's *quite* that easy. If you just maintain nextVictimBuffer
and completePasses as separate atomic counters, without a lock making the two
consistent, StrategySyncStart(), as coded right now / in your patch, won't
necessarily return reasonable values for the two. Which I think would lead to
bgwriter suddenly becoming overly active for one cycle and then very inactive
for a while after.
With really large shared buffers that'd be rather rare to be hit, but with
small shared buffers I think it'd be common enough to worry.
The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the number
of passes. Unfortunately doing so would require doing a modulo operation each
clock tick, which I think would likely be too expensive on common platforms -
on small shared_buffers I actually see existing, relatively rarely reached,
modulo in ClockSweepTick() show up on a Skylake-X system.
It'd be easier if we could rely on NBuffers to be a power of two, but that
doesn't seem like a realistic requirement.
I think the easiest way here would be to make StrategySyncStart(), a
relatively rare operation, retry whenever it detects that it saw out-of-sync
nextVictimBuffer. E.g. something roughly like
while (true)
{
completePasses = atomic_read_u32(&StrategyControl->completePasses);
pg_memory_barrier();
nextVictimBuffer = atomic_read_u32(&StrategyControl->nextVictimBuffer);
pg_memory_barrier();
if (completePasses == atomic_read_u32(&StrategyControl->completePasses) &&
nextVictimBuffer <= atomic_read_u32(&StrategyControl->nextVictimBuffer))
break;
}
which I think would detect the case that we read a nextVictimBuffer value that was
decreased while reading a completePasses value that was increased.
I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using CAS. I
recently noticed that the CAS sometimes has to retry a fair number of times,
which in turn makes the `victim % NBuffers` show up in profiles.
I'd like to hear ideas for worst cases to test and/or benchmark. I plan on
attempting what I saw Michael do with flamegraphs before/after/delta in a
follow up to this. If people feel this has merit I'll add it to CF/2.
What I've benchmarked is both single threaded and concurrent clock sweep, by
doing pg_prewarm() of per-pgbench-client relations. I used
c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');") -t1
(with c=1 for the single-threaded case obviously)
The reason for the pg_shmem_allocations_numa is to ensure that shared_buffers
is actually mapped, as otherwise the bottleneck will be the kernel zeroing out
buffers.
The reason for doing -t1 is that I wanted to compare freelist vs clock sweep,
rather than clock sweep in general.
Note that I patched EvictUnpinnedBufferInternal() to call
StrategyFreeBuffer(), otherwise running this a second time won't actually
measure the freelist. And the first time run after postmaster start will
always be more noisy...
Greetings,
Andres Freund
On Fri, Jul 11, 2025, at 2:50 PM, Nathan Bossart wrote:
On Fri, Jul 11, 2025 at 01:26:53PM -0400, Greg Burd wrote:
This change does remove the have_free_buffer() function used by the
contrib/pg_prewarm module. On the surface this doesn't seem to cause any
issues, but honestly I've not thought too deeply on this one.Hm. ISTM we'll either need to invent another similarly inexpensive way to
test for this or to justify to ourselves that it's not necessary. My guess
is that we do want to keep autoprewarm from evicting things, but FWIW the
docs already say "prewarming may also evict other data from cache" [0].
Thank you for spending time reviewing my proposal/patch!
I briefly considered how one might use what was left after surgery to produce some similar boolean signal to no avail. I think that autoprewarm was simply trying to at most warm NBuffers then stop. The freelist at startup was just a convenient thing to drain and get that done. Maybe I'll try adapting autoprewarm to consider that global instead.
Once removed [2] and tests passing [3] I took a long hard look at the
buffer_strategy_lock that used to serialize concurrent access to members
of BufferStrategyControl and I couldn't find a good reason to keep it
around. Let's review what it is guarding:completePasses: a count of the number of times the clock-sweep hand wraps
around. StrategySyncStart() provides this to the bgwriter which in turn
uses it to compute a strategic location at which to start scanning for
pages to evict. There's an interesting comment that indicates both a
"requirement" and an equivocal "but that's highly unlikely and wouldn't
be particularly harmful" statement conflicting with itself. I tried to
find a reason that nextVictimBuffers could overflow or that the use of
the completePasses value could somehow cause harm if off by one or more
in the bgwriter and either I missed it (please tell me) or there isn't
one. However, it does make sense to change completePasses into an atomic
value so that it is consistent across backends and in the bgwriter.bgwprocno: when not -1 is the PID of the allocation notification latch
(ProcGlobal->allProcs[bgwprocno].procLatch). This is a "power savings"
feature where the goal is to signal the bgwriter "When a backend starts
using buffers again, it will wake us up by setting our latch." Here the
code reads, "Since we don't want to rely on a spinlock for this we force
a read from shared memory once, and then set the latch based on that
value." and uses INT_ACCESS_ONCE() to read the value and set the latch.
The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no
reason to use atomic or other synchronization here.And that's all there is to it now that the freelist is gone. As a
result, IMO it seems unnecessary to require a spin lock for access to
BufferStrategyControl.I haven't followed your line of reasoning closely yet, but I typically
recommend that patches that replace locks with atomics use functions with
full barrier semantics (e.g., pg_atomic_read_membarrier_u32(),
pg_atomic_fetch_add_u32()) to make things easier to reason about. But that
might not be as straightforward in cases like StrategySyncStart() where we
atomically retrieve two values that are used together. Nevertheless,
minimizing cognitive load might be nice, and there's a chance it doesn't
impact performance very much.
Good thought. I'll review carefully and see if I can either explain here solid reasons I believe that they don't need full barrier semantics or change the patch accordingly.
again, thank you for your time, best.
-greg
Show quoted text
[0] https://www.postgresql.org/docs/devel/pgprewarm.html
--
nathan
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
Hi,
On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
In conversations [1] recently about considering how best to adapt the
code
to become NUMA-aware Andres commented, "FWIW, I've started to wonder
if we
shouldn't just get rid of the freelist entirely" and because I'm a
glutton
for punishment (and I think this idea has some merit) I took him up
on this
task.In freelist.c the function StrategyGetBuffer() currently tries first
to use
the BufferAccessStrategy, if present, via GetBufferFromRing(). Failing
that, the second step is to check the "freelist" as defined by
StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it
contains
any available buffers, and finally it will "Use the clock-sweep
algorithm to
find a free buffer." The freelist was intended to be "a list of buffers
that are prime candidates for replacement" but I question the value
of that
versus the overhead of managing it. Without the list some operations are
likely (I plan to measure this) faster due, as Anders points out in [1],
"just using clock sweep actually makes things like DROP TABLE perform
better
because it doesn't need to maintain the freelist anymore." It may be the
case that with very large NBuffers where most are in use that this
approach
is slower (I plan to measure this too), but in those cases I'd
imagine the
freelist is likely empty and so the code will use the clock-sweep
algorithm
anyway, so I'm not sure there is a performance penalty at all.This change does remove the have_free_buffer() function used by the
contrib/pg_prewarm module. On the surface this doesn't seem to cause any
issues, but honestly I've not thought too deeply on this one.
Thank you for spending time reviewing my proposal and patch! I value
your time
and insights and in this case your inspiration. :)
I think we'll likely need something to replace it.
Fair, this (v5) patch doesn't yet try to address this.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop
prewarming
shared buffers if doing so would just evict buffers. But it's not
clear to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? It'd make more sense
to me to
stop autoprewarm once NBuffers have been prewarmed...
I had the same high level reaction, that autoprewarm was leveraging
something
convenient but not necessarily required or even correct. I'd considered
using
NBuffers as you describe due to similar intuitions, I'll dig into that
idea for
the next revision after I get to know autoprewarm a bit better.
v2-0001 Eliminate the freelist from the buffer manager and depend on
clock-sweep.Once removed [2] and tests passing [3] I took a long hard look at the
buffer_strategy_lock that used to serialize concurrent access to
members of
BufferStrategyControl and I couldn't find a good reason to keep it
around.
Let's review what it is guarding:completePasses: a count of the number of times the clock-sweep hand wraps
around. StrategySyncStart() provides this to the bgwriter which in turn
uses it to compute a strategic location at which to start scanning
for pages
to evict. There's an interesting comment that indicates both a
"requirement" and an equivocal "but that's highly unlikely and
wouldn't be
particularly harmful" statement conflicting with itself. I tried to
find a
reason that nextVictimBuffers could overflow or that the use of the
completePasses value could somehow cause harm if off by one or more
in the
bgwriter and either I missed it (please tell me) or there isn't one.
However, it does make sense to change completePasses into an atomic
value so
that it is consistent across backends and in the bgwriter.I don't think it's *quite* that easy. If you just maintain
nextVictimBuffer
and completePasses as separate atomic counters, without a lock making
the two
consistent, StrategySyncStart(), as coded right now / in your patch, won't
necessarily return reasonable values for the two. Which I think would
lead to
bgwriter suddenly becoming overly active for one cycle and then very
inactive
for a while after.
Keeping them separate as atomics still requires coordination that I
think is unnecessary. I spent some time and came up with a working
version [1]https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v4 where
the two values (nextVictimBuffer and completePasses) were in a singular
uint64
atomic, but as soon as I had it working I didn't like the idea at all.
With really large shared buffers that'd be rather rare to be hit, but with
small shared buffers I think it'd be common enough to worry.
Agreed, not coordinating these values isn't a viable solution.
The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the
number
of passes. Unfortunately doing so would require doing a modulo
operation each
clock tick, which I think would likely be too expensive on common
platforms -
on small shared_buffers I actually see existing, relatively rarely
reached,
modulo in ClockSweepTick() show up on a Skylake-X system.
So, this idea came back to me today as I tossed out the union branch and
started
over.
a) can't require a power of 2 for NBuffers
b) would like a power of 2 for NBuffers to make a few things more efficient
c) a simple uint64 atomic counter would simplify things
The attached (v5) patch takes this approach *and* avoids the modulo you were
concerned with. My approach is to have nextVictimBuffer as a uint64
that only
increments (and at some point 200 years or so might wrap around, but I
digress).
To get the actual "victim" you modulo that, but not with "%" you call
clock_modulo(). In that function I use a "next power of 2" value rather
than
NBuffers to efficiently find the modulo and adjust for the actual
value. Same
for completePasses which is now a function clock_passes() that does similar
trickery and returns the number of times the counter (nextVictimBuffer) has
"wrapped" around modulo NBuffers.
Now that both values exist in the same uint64 it can be the atomic
vessel that coordinates them, no synchronization problems at all and no
requirement for
the buffer_strategy_lock.
It'd be easier if we could rely on NBuffers to be a power of two, but that
doesn't seem like a realistic requirement.
Yes, this was the good idea I ran with it in a slightly more creative
way that
doesn't require users to set NBuffers to a power of 2.
I think the easiest way here would be to make StrategySyncStart(), a
relatively rare operation, retry whenever it detects that it saw
out-of-sync
nextVictimBuffer. E.g. something roughly likewhile (true)
{
completePasses = atomic_read_u32(&StrategyControl->completePasses);
pg_memory_barrier();
nextVictimBuffer =
atomic_read_u32(&StrategyControl->nextVictimBuffer);
pg_memory_barrier();if (completePasses ==
atomic_read_u32(&StrategyControl->completePasses) &&
nextVictimBuffer <=
atomic_read_u32(&StrategyControl->nextVictimBuffer))
break;
}which I think would detect the case that we read a nextVictimBuffer
value that was
decreased while reading a completePasses value that was increased.
Okay, interesting idea. If you dislike this approach I'll circle back and
consider it.
I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using
CAS. I
recently noticed that the CAS sometimes has to retry a fair number of
times,
which in turn makes the `victim % NBuffers` show up in profiles.
In my (v5) patch there is one CAS that increments NBuffers. All other
operations on NBuffers are atomic reads. The modulo you mention is gone
entirely, unnecessary AFAICT.
I'd like to hear ideas for worst cases to test and/or benchmark. I
plan on
attempting what I saw Michael do with flamegraphs before/after/delta in a
follow up to this. If people feel this has merit I'll add it to CF/2.What I've benchmarked is both single threaded and concurrent clock
sweep, by
doing pg_prewarm() of per-pgbench-client relations. I usedc=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT
numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE
size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo
"SELECT pg_prewarm('copytest_:client_id');") -t1(with c=1 for the single-threaded case obviously)
The reason for the pg_shmem_allocations_numa is to ensure that
shared_buffers
is actually mapped, as otherwise the bottleneck will be the kernel
zeroing out
buffers.The reason for doing -t1 is that I wanted to compare freelist vs clock
sweep,
rather than clock sweep in general.Note that I patched EvictUnpinnedBufferInternal() to call
StrategyFreeBuffer(), otherwise running this a second time won't actually
measure the freelist. And the first time run after postmaster start will
always be more noisy...
This is very helpful, thanks! I've started doing some of this but I was
anxious to get this out before the weekend. I'll work on the prewarm module
and get some benchmarks done next week.
Meanwhile, the tests except for Windows pass [2]https://github.com/gburd/postgres/pull/6/checks for this new patch
[3]: https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v5
dig into the Windows issues next week as well.
Greetings,
Andres Freund
I'm very curious to hear back your thoughts (or anyone else) on this
approach.
best,
-greg
[1]: https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v4
[2]: https://github.com/gburd/postgres/pull/6/checks
[3]: https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v5
Attachments:
v5-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchtext/x-patch; charset=UTF-8; name=v5-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchDownload
From 79c17005460c588ad6e96c2fdcf5893d789239b2 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v5 1/2] Eliminate the freelist from the buffer manager and
depend on clocksweep.
This set of changes removes the list of available buffers and instead simply
uses the clocksweep algorithm to find and return an available buffer. While on
the surface this appears to be removing an optimization it is in fact
eliminating code that induces overhead in the form of synchronization that is
problemmatic for multi-core systems. This change removes the
have_free_buffer() function that was used in the pg_prewarm module.
---
contrib/pg_prewarm/autoprewarm.c | 16 +---
src/backend/storage/buffer/README | 42 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +-----
src/backend/storage/buffer/freelist.c | 126 +-------------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 26 insertions(+), 208 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..68f21d94473 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -410,10 +410,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +458,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -526,7 +516,7 @@ autoprewarm_database_main(Datum main_arg)
* Loop until we run out of blocks to prewarm or until we run out of free
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -574,8 +564,8 @@ autoprewarm_database_main(Datum main_arg)
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
+ /* have_free_buffer()) */
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..cd52effd911 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle when there are no free buffers available,
+we use a simple clock-sweep algorithm, which avoids the need to take
+system-wide locks during common operations. It works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
@@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6afdd28dba6..af5ef025229 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2262,11 +2255,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..a228ff27377 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -163,23 +155,6 @@ ClockSweepTick(void)
return victim;
}
-/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
- *
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
- */
-bool
-have_free_buffer(void)
-{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -243,75 +218,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We keep an approximate count of buffer allocation requests so that the
+ * bgwriter can estimate the rate of buffer consumption. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +270,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +395,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..00eade63971 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
v5-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patchtext/x-patch; charset=UTF-8; name=v5-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patchDownload
From 21b7f03f62ca69e412978ecbefbd4279fc68662e Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v5 2/2] Remove the buffer_strategy_lock and make the clock
hand a 64 bit atomic
Change nextVictimBuffer to an atomic uint64 and simply atomically
increment it by 1 at each tick. The next victim buffer is the the value
of nextVictimBuffer modulo the number of buffers (NBuffers). Modulo can
be expensive so we implement that as if the value of NBuffers was
requied to be a power of 2 and account for the difference. The value of
nextVictimBuffer, because it is only ever incremented, now encodes
enough information to provide the number of completed passes of the
clock-sweep algorithm as well. This eliminates the need for a separate
counter and related maintainance. While wrap-around of nextVictimBuffer
would require at least 200 years on today's hardware, should that happen
BgBuferSync will properly determine the delta of passes.
With the removal of the freelist and completePasses none of remaining
items in the BufferStrategyControl structure require strict coordination
and so it is possible to eliminate the buffer_strategy_lock as well.
---
src/backend/storage/buffer/README | 48 ++++----
src/backend/storage/buffer/bufmgr.c | 20 +++-
src/backend/storage/buffer/freelist.c | 166 +++++++++++++-------------
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 +-
5 files changed, 121 insertions(+), 119 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..d1ab222eeb8 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-To choose a victim buffer to recycle when there are no free buffers available,
-we use a simple clock-sweep algorithm, which avoids the need to take
-system-wide locks during common operations. It works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+which avoids the need to take system-wide locks during common operations. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer is protected by the
-buffer_strategy_lock.
+through all the available buffers. nextVictimBuffer and completePasses are
+atomic values.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
4. Pin the selected buffer, and return.
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
@@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so it needs only to
+spinlock each buffer header for long enough to check the dirtybit. Even
+without that assumption, the writer only needs to take the lock long enough to
+read the variable value, not while scanning the buffers. (This is a very
+substantial improvement in the contention cost of the writer compared to PG
+8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af5ef025229..0be6f4d8c80 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the clock sweep currently is, and how many buffer
+ * Find out where the clock-sweep currently is, and how many buffer
* allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
*/
if (saved_info_valid)
{
- int32 passes_delta = strategy_passes - prev_strategy_passes;
+ int32 passes_delta;
+
+ if (unlikely(prev_strategy_passes > strategy_passes))
+ {
+ /* wrap-around case */
+ passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
+ }
+ else
+ {
+ passes_delta = (int32) (strategy_passes - prev_strategy_passes);
+ }
strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index a228ff27377..940ec533d66 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
*/
#include "postgres.h"
+#include <math.h>
#include "pgstat.h"
#include "port/atomics.h"
#include "storage/buf_internals.h"
@@ -29,21 +30,17 @@
*/
typedef struct
{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
- * this isn't a concrete buffer - we only ever increase the value. So, to
- * get an actual buffer, it needs to be used modulo NBuffers.
+ * This is used as both the clock-sweep hand and the number of of complete
+ * passes through the buffer pool. The lower bits below NBuffers are the
+ * clock-sweep and the upper bits are the number of complete passes.
*/
- pg_atomic_uint32 nextVictimBuffer;
+ pg_atomic_uint64 nextVictimBuffer;
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -83,6 +80,9 @@ typedef struct BufferAccessStrategyData
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
} BufferAccessStrategyData;
+static uint32 NBuffersPow2; /* NBuffers rounded up to the next power of 2 */
+static uint32 NBuffersPow2Shift; /* Amount to bitshift NBuffers for
+ * division */
/* Prototypes for internal functions */
static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
@@ -90,6 +90,58 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+ /*
+ * Calculate the number of complete passes through the buffer pool that have
+ * happened thus far. A "pass" is defined as the clock hand moving through
+ * all the buffers (NBuffers) in the pool once.
+ *
+ * This implements full_128bit_multiply(counter, reciprocal) >> 64 without
+ * the need for 128-bit arithmetic/types.
+ */
+static inline uint32
+clock_passes(uint64 counter)
+{
+ uint32 result;
+
+ /* Calculate complete power of 2 cycles by bitshifting */
+ uint64 pow2_passes = counter >> NBuffersPow2Shift;
+ /* Determine the counter's current position in the cycle */
+ uint64 masked_counter = counter & (NBuffersPow2 - 1);
+ /* Has the counter passed NBuffers yet? */
+ uint32 extra_passes = (masked_counter >= NBuffers) ? 1 : 0;
+ /* Calculate passes per power-of-2, typically 1 or 2 */
+ uint32 passes_per_cycle = NBuffersPow2 / NBuffers;
+ /*
+ * Combine total passes, multiply complete power-of-2 cycles by passes per
+ * cycle, then add any extra pass from the current incomplete cycle.
+ */
+ result = (uint32) (pow2_passes * passes_per_cycle) + extra_passes;
+
+ Assert(result <= UINT32_MAX);
+ Assert(result == ((uint32) (counter / NBuffers)));
+
+ return result;
+}
+
+ /*
+ * Calculate the counter module the number of buffers in the pool (NBuffers).
+ */
+static inline uint32
+clock_modulo(uint64 counter)
+{
+ /* Determine the counter's current position in the cycle */
+ uint64 result = (uint32) counter & (NBuffersPow2 - 1);
+
+ /* Adjust if the next power of 2 masked counter is more than NBuffers */
+ if (result >= NBuffers)
+ result -= NBuffers;
+
+ Assert(result < NBuffers);
+ Assert(result == (uint32) (counter % NBuffers));
+
+ return result;
+}
+
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
*
@@ -99,6 +151,7 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
static inline uint32
ClockSweepTick(void)
{
+ uint64 counter;
uint32 victim;
/*
@@ -106,52 +159,11 @@ ClockSweepTick(void)
* doing this, this can lead to buffers being returned slightly out of
* apparent order.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
- if (victim >= NBuffers)
- {
- uint32 originalVictim = victim;
+ counter = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
+ victim = clock_modulo(counter);
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ Assert(victim < NBuffers);
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
- {
- uint32 expected;
- uint32 wrapped;
- bool success = false;
-
- expected = originalVictim + 1;
-
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- wrapped = expected % NBuffers;
-
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
- }
return victim;
}
@@ -177,10 +189,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -225,7 +234,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock sweep" algorithm to find a free buffer */
+ /* Use the "clock-sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -281,32 +290,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* allocs if non-NULL pointers are passed. The alloc count is reset after
* being read.
*/
-int
+uint32
StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
- uint32 nextVictimBuffer;
- int result;
+ uint64 counter;
+ uint32 result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+ result = clock_modulo(counter);
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
-
- /*
- * Additionally add the number of wraparounds that happened before
- * completePasses could be incremented. C.f. ClockSweepTick().
- */
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes = clock_passes(counter);
}
if (num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -321,21 +323,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
* StrategyShmemSize
*
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
*
* Note: for somewhat historical reasons, the buffer lookup hashtable size
* is also determined here.
@@ -393,13 +388,14 @@ StrategyInitialize(bool init)
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize combined clock-sweep pointer/complete passes counter */
+ pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
+ /* Find the smallest power of 2 larger than NBuffers */
+ NBuffersPow2 = pg_nextpower2_32(NBuffers);
+ /* Using that, find the number of positions to shift for division */
+ NBuffersPow2Shift = pg_leftmost_one_pos32(NBuffersPow2);
/* Clear statistics */
- StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -643,7 +639,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 00eade63971..97002acb757 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
Hi,
I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.
On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
I think we'll likely need something to replace it.
Fair, this (v5) patch doesn't yet try to address this.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop
prewarming
shared buffers if doing so would just evict buffers. �But it's not clear
to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? �It'd make more sense to
me to
stop autoprewarm once NBuffers have been prewarmed...I had the same high level reaction, that autoprewarm was leveraging
something
convenient but not necessarily required or even correct. �I'd considered
using
NBuffers as you describe due to similar intuitions, I'll dig into that idea
for
the next revision after I get to know autoprewarm a bit better.
Cool. I do think that'll be good enough.
The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the
number
of passes. �Unfortunately doing so would require doing a modulo
operation each
clock tick, which I think would likely be too expensive on common
platforms -
on small shared_buffers I actually see existing, relatively rarely
reached,
modulo in ClockSweepTick() show up on a Skylake-X system.So, this idea came back to me today as I tossed out the union branch and
started
over.a) can't require a power of 2 for NBuffers
b) would like a power of 2 for NBuffers to make a few things more efficient
c) a simple uint64 atomic counter would simplify thingsThe attached (v5) patch takes this approach *and* avoids the modulo you were
concerned with.� My approach is to have nextVictimBuffer as a uint64 that
only
increments (and at some point 200 years or so might wrap around, but I
digress).
To get the actual "victim" you modulo that, but not with "%" you call
clock_modulo().� In that function I use a "next power of 2" value rather
than
NBuffers to efficiently find the modulo and adjust for the actual value.
Same
for completePasses which is now a function clock_passes() that does similar
trickery and returns the number of times the counter (nextVictimBuffer) has
"wrapped" around modulo NBuffers.
Yea, that could work! It'd be interesting to see some performance numbers for
this...
Now that both values exist in the same uint64 it can be the atomic vessel
that coordinates them, no synchronization problems at all and no requirement
for the buffer_strategy_lock.
Nice!
I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using
CAS. I
recently noticed that the CAS sometimes has to retry a fair number of
times,
which in turn makes the `victim % NBuffers` show up in profiles.In my (v5) patch there is one CAS that increments NBuffers.� All other
operations on NBuffers are atomic reads.� The modulo you mention is gone
entirely, unnecessary AFAICT.
There shouldn't be any CASes needed now, right? Just a fetch-add? The latter
often scales *way* better under contention.
[Looks at the patch ...]
Which I think is true in your patch, I don't see any CAS.
Meanwhile, the tests except for Windows pass [2] for this new patch [3].
I'll dig into the Windows issues next week as well.
FWIW, there are backtraces generated on windows. E.g.
000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x000000cd`827fdfc0,
unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x00000000`00000000,
unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
I.e. your new assertion failed for some reason that i can't *immediately* see.
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/* * Compute strategy_delta = how many buffers have been scanned by the - * clock sweep since last time. If first time through, assume none. Then - * see if we are still ahead of the clock sweep, and if so, how many + * clock-sweep since last time. If first time through, assume none. Then + * see if we are still ahead of the clock-sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { - int32 passes_delta = strategy_passes - prev_strategy_passes; + int32 passes_delta; + + if (unlikely(prev_strategy_passes > strategy_passes)) + { + /* wrap-around case */ + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); + } + else + { + passes_delta = (int32) (strategy_passes - prev_strategy_passes); + }strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
That seems somewhat independent of the rest of the change, or am I missing something?
+static uint32 NBuffersPow2; /* NBuffers rounded up to the next power of 2 */ +static uint32 NBuffersPow2Shift; /* Amount to bitshift NBuffers for + * division */
For performance in ClockSweepTick() it might more sense to store the mask
(i.e. NBuffersPow2 - 1), rather than the actual power of two.
Greetings,
Andres Freund
On 7/18/25 13:03, Andres Freund wrote:
Hi,
Hello. Thanks again for taking the time to review the email and patch,
I think we're onto something good here.
I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.
Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
for keeping the freelist"?
On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
I think we'll likely need something to replace it.
Fair, this (v5) patch doesn't yet try to address this.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop
prewarming
shared buffers if doing so would just evict buffers. But it's not clear
to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? It'd make more sense to
me to
stop autoprewarm once NBuffers have been prewarmed...I had the same high level reaction, that autoprewarm was leveraging
something
convenient but not necessarily required or even correct. I'd considered
using
NBuffers as you describe due to similar intuitions, I'll dig into that idea
for
the next revision after I get to know autoprewarm a bit better.Cool. I do think that'll be good enough.
I re-added the have_free_buffer() function only now it returns false
once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
has made its first complete pass. With that I reverted my changes in
the autoprewarm module. The net should be the same behavior as before
at startup when using that module.
The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the
number
of passes. Unfortunately doing so would require doing a modulo
operation each
clock tick, which I think would likely be too expensive on common
platforms -
on small shared_buffers I actually see existing, relatively rarely
reached,
modulo in ClockSweepTick() show up on a Skylake-X system.So, this idea came back to me today as I tossed out the union branch and
started
over.a) can't require a power of 2 for NBuffers
b) would like a power of 2 for NBuffers to make a few things more efficient
c) a simple uint64 atomic counter would simplify thingsThe attached (v5) patch takes this approach *and* avoids the modulo you were
concerned with. My approach is to have nextVictimBuffer as a uint64 that
only
increments (and at some point 200 years or so might wrap around, but I
digress).
To get the actual "victim" you modulo that, but not with "%" you call
clock_modulo(). In that function I use a "next power of 2" value rather
than
NBuffers to efficiently find the modulo and adjust for the actual value.
Same
for completePasses which is now a function clock_passes() that does similar
trickery and returns the number of times the counter (nextVictimBuffer) has
"wrapped" around modulo NBuffers.Yea, that could work! It'd be interesting to see some performance numbers for
this...
Still no performance comparisons yet, but my gut says this should reduce
contention across cores on a very hot path so I'd imagine some
performance improvement.
Now that both values exist in the same uint64 it can be the atomic vessel
that coordinates them, no synchronization problems at all and no requirement
for the buffer_strategy_lock.Nice!
I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using
CAS. I
recently noticed that the CAS sometimes has to retry a fair number of
times,
which in turn makes the `victim % NBuffers` show up in profiles.In my (v5) patch there is one CAS that increments NBuffers. All other
operations on NBuffers are atomic reads. The modulo you mention is gone
entirely, unnecessary AFAICT.There shouldn't be any CASes needed now, right? Just a fetch-add? The latter
often scales *way* better under contention.[Looks at the patch ...]
Which I think is true in your patch, I don't see any CAS.
You are correct, no CAS at all anymore just a mental mistake in the last
email. Now there are only atomic reads and single atomic fetch-add in
ClockSweepTick().
Meanwhile, the tests except for Windows pass [2] for this new patch [3].
I'll dig into the Windows issues next week as well.FWIW, there are backtraces generated on windows. E.g.
000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x000000cd`827fdfc0,
unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x00000000`00000000,
unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]I.e. your new assertion failed for some reason that i can't *immediately* see.
I put that in as a precaution and as a way to communicate the intention
of the other code above it. I never imagined it would assert. I've
changed clock_read() to only assert when the modulo differs and left
that assert in the calling ClockSweepTick() function because it was
redundant and I'm curious to see if we see a similar assert when testing
the modulo.
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/* * Compute strategy_delta = how many buffers have been scanned by the - * clock sweep since last time. If first time through, assume none. Then - * see if we are still ahead of the clock sweep, and if so, how many + * clock-sweep since last time. If first time through, assume none. Then + * see if we are still ahead of the clock-sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { - int32 passes_delta = strategy_passes - prev_strategy_passes; + int32 passes_delta; + + if (unlikely(prev_strategy_passes > strategy_passes)) + { + /* wrap-around case */ + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); + } + else + { + passes_delta = (int32) (strategy_passes - prev_strategy_passes); + }strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;That seems somewhat independent of the rest of the change, or am I missing something?
That change is there to cover the possibility of someone managing to
overflow and wrap a uint64 which is *highly* unlikely. If this degree
of paranoia isn't required I'm happy to remove it.
+static uint32 NBuffersPow2; /* NBuffers rounded up to the next power of 2 */ +static uint32 NBuffersPow2Shift; /* Amount to bitshift NBuffers for + * division */For performance in ClockSweepTick() it might more sense to store the mask
(i.e. NBuffersPow2 - 1), rather than the actual power of two.
Agreed, I've done that and created one more calculated value that could
be pre-computed once and never again (unless NBuffers changes) at runtime.
Greetings,
Andres Freund
thanks again for the review, v6 attached and re-based onto afa5c365ec5,
also on GitHub at [1]https://github.com/gburd/postgres/pull/7/checks[2]https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v6.
-greg
[1]: https://github.com/gburd/postgres/pull/7/checks
[2]: https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v6
Attachments:
v6-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchtext/x-patch; charset=UTF-8; name=v6-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchDownload
From 4b747751d9c2fb679496f8c0c0d4dd4373a14b48 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v6 1/2] Eliminate the freelist from the buffer manager and
depend on clock-sweep.
This set of changes removes the list of available buffers and instead
simply uses the clock-sweep algorithm to find and return an available
buffer. While on the surface this appears to be removing an
optimization it is in fact eliminating code that induces overhead in the
form of synchronization that is problemmatic for multi-core systems.
This also changes the have_free_buffer() function to return true until
every buffer in the pool has been considered once by the clock-sweep
algorithm so as to inform the the pg_prewarm module as to when to stop
warming.
---
src/backend/storage/buffer/README | 42 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +------
src/backend/storage/buffer/freelist.c | 120 +++-----------------------
src/include/storage/buf_internals.h | 11 +--
5 files changed, 28 insertions(+), 183 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..cd52effd911 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle when there are no free buffers available,
+we use a simple clock-sweep algorithm, which avoids the need to take
+system-wide locks during common operations. It works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
@@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6afdd28dba6..af5ef025229 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2262,11 +2255,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..162c140fb9d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -164,17 +156,16 @@ ClockSweepTick(void)
}
/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
+ * have_free_buffer -- check if we've filled the buffer pool at startup
*
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
+ * Used exclusively by autoprewarm.
*/
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
+ uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+
+ if (hand < NBuffers)
return true;
else
return false;
@@ -243,75 +234,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We keep an approximate count of buffer allocation requests so that the
+ * bgwriter can estimate the rate of buffer consumption. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +286,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +411,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..d4449e11384 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
--
2.49.0
v6-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patchtext/x-patch; charset=UTF-8; name=v6-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patchDownload
From 167a36a6f38383a493cea88ba574a498e4b37dce Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v6 2/2] Remove the buffer_strategy_lock and make the clock
hand a 64 bit atomic
Change nextVictimBuffer to an atomic uint64 and simply atomically
increment it by 1 at each tick. The next victim buffer is the the value
of nextVictimBuffer modulo the number of buffers (NBuffers). Modulo can
be expensive so we implement that as if the value of NBuffers was
requied to be a power of 2 and account for the difference. The value of
nextVictimBuffer, because it is only ever incremented, now encodes
enough information to provide the number of completed passes of the
clock-sweep algorithm as well. This eliminates the need for a separate
counter and related maintainance. While wrap-around of nextVictimBuffer
would require at least 200 years on today's hardware, should that happen
BgBuferSync will properly determine the delta of passes.
With the removal of the freelist and completePasses none of remaining
items in the BufferStrategyControl structure require strict coordination
and so it is possible to eliminate the buffer_strategy_lock as well.
---
src/backend/storage/buffer/README | 48 ++++---
src/backend/storage/buffer/bufmgr.c | 20 ++-
src/backend/storage/buffer/freelist.c | 176 +++++++++++++-------------
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 +-
5 files changed, 131 insertions(+), 119 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..d1ab222eeb8 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-To choose a victim buffer to recycle when there are no free buffers available,
-we use a simple clock-sweep algorithm, which avoids the need to take
-system-wide locks during common operations. It works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+which avoids the need to take system-wide locks during common operations. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer is protected by the
-buffer_strategy_lock.
+through all the available buffers. nextVictimBuffer and completePasses are
+atomic values.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
4. Pin the selected buffer, and return.
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
@@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so it needs only to
+spinlock each buffer header for long enough to check the dirtybit. Even
+without that assumption, the writer only needs to take the lock long enough to
+read the variable value, not while scanning the buffers. (This is a very
+substantial improvement in the contention cost of the writer compared to PG
+8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af5ef025229..0be6f4d8c80 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the clock sweep currently is, and how many buffer
+ * Find out where the clock-sweep currently is, and how many buffer
* allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
*/
if (saved_info_valid)
{
- int32 passes_delta = strategy_passes - prev_strategy_passes;
+ int32 passes_delta;
+
+ if (unlikely(prev_strategy_passes > strategy_passes))
+ {
+ /* wrap-around case */
+ passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
+ }
+ else
+ {
+ passes_delta = (int32) (strategy_passes - prev_strategy_passes);
+ }
strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 162c140fb9d..0b49d178362 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
*/
#include "postgres.h"
+#include <math.h>
#include "pgstat.h"
#include "port/atomics.h"
#include "storage/buf_internals.h"
@@ -29,21 +30,17 @@
*/
typedef struct
{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
- * this isn't a concrete buffer - we only ever increase the value. So, to
- * get an actual buffer, it needs to be used modulo NBuffers.
+ * This is used as both the clock-sweep hand and the number of of complete
+ * passes through the buffer pool. The lower bits below NBuffers are the
+ * clock-sweep and the upper bits are the number of complete passes.
*/
- pg_atomic_uint32 nextVictimBuffer;
+ pg_atomic_uint64 nextVictimBuffer;
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -83,12 +80,71 @@ typedef struct BufferAccessStrategyData
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
} BufferAccessStrategyData;
+static uint32 NBuffersPow2Mask; /* Next power-of-2 >= NBuffers - 1 */
+static uint32 NBuffersPow2Shift; /* Amount to bitshift for division */
+static uint32 NBuffersPerCycle; /* Number of buffers in a complete cycle */
/* Prototypes for internal functions */
static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+static inline uint32 clock_passes(uint64 hand);
+static inline uint32 clock_read(uint64 hand);
+
+ /*
+ * Calculate the number of complete passes through the buffer pool that have
+ * happened thus far. A "pass" is defined as the clock hand moving through
+ * all the buffers (NBuffers) in the pool once. Our clock hand is a 64-bit
+ * counter that only increases. The number of passes is the upper bits of the
+ * counter divided by NBuffers.
+ */
+static inline uint32
+clock_passes(uint64 hand)
+{
+ uint32 result;
+
+ /* Calculate complete next power-of-2 cycles by bitshifting */
+ uint64 pow2_passes = hand >> NBuffersPow2Shift;
+
+ /* Determine the hand's current position in the cycle */
+ uint64 masked_hand = hand & NBuffersPow2Mask;
+
+ /* Has the hand passed NBuffers yet? */
+ uint32 extra_passes = (masked_hand >= NBuffers) ? 1 : 0;
+
+ /*
+ * Combine total passes, multiply complete power-of-2 cycles by passes
+ * per-cycle, then add any extra pass from the current incomplete cycle.
+ */
+ result = (uint32) (pow2_passes * NBuffersPerCycle) + extra_passes;
+
+ Assert(result <= UINT32_MAX);
+ Assert(result == ((uint32) (hand / NBuffers)));
+
+ return result;
+}
+
+ /*
+ * The hand's value is a 64-bit counter that only increases, so its position
+ * is determined by the lower bits of the counter modulo by NBuffers. To
+ * avoid the modulo operation we use the next power-of-2 mask and adjust for
+ * the difference.
+ */
+static inline uint32
+clock_read(uint64 hand)
+{
+ /* Determine the hand's current position in the cycle */
+ uint64 result = (uint32) hand & NBuffersPow2Mask;
+
+ /* Adjust if the next power of 2 masked counter is more than NBuffers */
+ if (result >= NBuffers)
+ result -= NBuffers;
+
+ Assert(result == (uint32) (hand % NBuffers));
+
+ return result;
+}
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -99,6 +155,7 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
static inline uint32
ClockSweepTick(void)
{
+ uint64 hand;
uint32 victim;
/*
@@ -106,52 +163,11 @@ ClockSweepTick(void)
* doing this, this can lead to buffers being returned slightly out of
* apparent order.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+ hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
+ victim = clock_read(hand);
- if (victim >= NBuffers)
- {
- uint32 originalVictim = victim;
-
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ Assert(victim < NBuffers);
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
- {
- uint32 expected;
- uint32 wrapped;
- bool success = false;
-
- expected = originalVictim + 1;
-
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- wrapped = expected % NBuffers;
-
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
- }
return victim;
}
@@ -193,10 +209,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -241,7 +254,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock sweep" algorithm to find a free buffer */
+ /* Use the "clock-sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -297,32 +310,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* allocs if non-NULL pointers are passed. The alloc count is reset after
* being read.
*/
-int
+uint32
StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
- uint32 nextVictimBuffer;
- int result;
+ uint64 counter;
+ uint32 result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+ result = clock_read(counter);
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
-
- /*
- * Additionally add the number of wraparounds that happened before
- * completePasses could be incremented. C.f. ClockSweepTick().
- */
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes = clock_passes(counter);
}
if (num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -337,21 +343,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
* StrategyShmemSize
*
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
*
* Note: for somewhat historical reasons, the buffer lookup hashtable size
* is also determined here.
@@ -404,18 +403,25 @@ StrategyInitialize(bool init)
if (!found)
{
+ uint32 NBuffersPow2;
+
/*
* Only done once, usually in postmaster
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize combined clock-sweep pointer/complete passes counter */
+ pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
+ /* Find the smallest power of 2 larger than NBuffers */
+ NBuffersPow2 = pg_nextpower2_32(NBuffers);
+ /* Using that, find the number of positions to shift for division */
+ NBuffersPow2Shift = pg_leftmost_one_pos32(NBuffersPow2);
+ /* Calculate passes per power-of-2, typically 1 or 2 */
+ NBuffersPerCycle = NBuffersPow2 / NBuffers;
+ /* The bitmask to extract the lower portion of the clock */
+ NBuffersPow2Mask = NBuffersPow2 - 1;
/* Clear statistics */
- StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -659,7 +665,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index d4449e11384..f2283ea8e22 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
Hi,
On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
On 7/18/25 13:03, Andres Freund wrote:
Hello.� Thanks again for taking the time to review the email and patch,
I think we're onto something good here.I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
for keeping the freelist"?
Err, yes :(
On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
I think we'll likely need something to replace it.
Fair, this (v5) patch doesn't yet try to address this.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop
prewarming
shared buffers if doing so would just evict buffers. �But it's not clear
to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? �It'd make more sense to
me to
stop autoprewarm once NBuffers have been prewarmed...I had the same high level reaction, that autoprewarm was leveraging
something
convenient but not necessarily required or even correct. �I'd considered
using
NBuffers as you describe due to similar intuitions, I'll dig into that idea
for
the next revision after I get to know autoprewarm a bit better.Cool. I do think that'll be good enough.
I re-added the have_free_buffer() function only now it returns false
once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
has made its first complete pass.� With that I reverted my changes in
the autoprewarm module.� The net should be the same behavior as before
at startup when using that module.
I don't think we should have a have_free_buffer() that doesn't actually test
whether we have a free buffer, that seems too likely to cause
misunderstandings down the line. What if we instead just limit the amount of
buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
the number of to-be-loaded buffers, so that shouldn't be hard.
Meanwhile, the tests except for Windows pass [2] for this new patch [3].
I'll dig into the Windows issues next week as well.FWIW, there are backtraces generated on windows. E.g.
000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x000000cd`827fdfc0,
unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x00000000`00000000,
unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]I.e. your new assertion failed for some reason that i can't *immediately* see.
I put that in as a precaution and as a way to communicate the intention
of the other code above it.� I never imagined it would assert.� I've
changed clock_read() to only assert when the modulo differs and left
that assert in the calling ClockSweepTick() function because it was
redundant and I'm curious to see if we see a similar assert when testing
the modulo.
Do you understand why it triggered? Because I don't immediately. The fact that
it triggered only on windows, where the compiler is rather different, makes it
worth understanding imo.
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/* * Compute strategy_delta = how many buffers have been scanned by the - * clock sweep since last time. If first time through, assume none. Then - * see if we are still ahead of the clock sweep, and if so, how many + * clock-sweep since last time. If first time through, assume none. Then + * see if we are still ahead of the clock-sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { - int32 passes_delta = strategy_passes - prev_strategy_passes; + int32 passes_delta; + + if (unlikely(prev_strategy_passes > strategy_passes)) + { + /* wrap-around case */ + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); + } + else + { + passes_delta = (int32) (strategy_passes - prev_strategy_passes); + }strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;That seems somewhat independent of the rest of the change, or am I missing something?
That change is there to cover the possibility of someone managing to
overflow and wrap a uint64 which is *highly* unlikely.
That risk existed previously too - I'm not against shoring things up, I'd just
do it in a precursor commit, to make this easier to review.
If this degree of paranoia isn't required I'm happy to remove it.
That does indeed seem really unlikely. Assuming that postgres stays up for 10
years without a single restart, it'd be ~59 billion ticks a second.
I don't mind a defense, but I think we'd be better off putting it into
ClockSweepTick() or such, simply erroring out if we ever hit this. It's
unlikely that we'd get (and keep) all the relevant untested code correct ime.
Then we also can assert that prev_strategy_passes <= strategy_passes.
Greetings,
Andres Freund
On 7/21/25 14:35, Andres Freund wrote:
Hi,
On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
On 7/18/25 13:03, Andres Freund wrote:
Hello. Thanks again for taking the time to review the email and patch,
I think we're onto something good here.I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
for keeping the freelist"?Err, yes :(
Phew. :) No worries.
On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
I think we'll likely need something to replace it.
Fair, this (v5) patch doesn't yet try to address this.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop
prewarming
shared buffers if doing so would just evict buffers. But it's not clear
to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? It'd make more sense to
me to
stop autoprewarm once NBuffers have been prewarmed...I had the same high level reaction, that autoprewarm was leveraging
something
convenient but not necessarily required or even correct. I'd considered
using
NBuffers as you describe due to similar intuitions, I'll dig into that idea
for
the next revision after I get to know autoprewarm a bit better.Cool. I do think that'll be good enough.
I re-added the have_free_buffer() function only now it returns false
once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
has made its first complete pass. With that I reverted my changes in
the autoprewarm module. The net should be the same behavior as before
at startup when using that module.I don't think we should have a have_free_buffer() that doesn't actually test
whether we have a free buffer, that seems too likely to cause
misunderstandings down the line. What if we instead just limit the amount of
buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
the number of to-be-loaded buffers, so that shouldn't be hard.
I'm glad you said that, I wasn't thrilled with that either and I'm not
sure why I didn't just correct for that in the last patch set. I'm now
capping num_elements to NBuffers at most.
Meanwhile, the tests except for Windows pass [2] for this new patch [3].
I'll dig into the Windows issues next week as well.FWIW, there are backtraces generated on windows. E.g.
000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x000000cd`827fdfc0,
unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x00000000`00000000,
unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]I.e. your new assertion failed for some reason that i can't *immediately* see.
I put that in as a precaution and as a way to communicate the intention
of the other code above it. I never imagined it would assert. I've
changed clock_read() to only assert when the modulo differs and left
that assert in the calling ClockSweepTick() function because it was
redundant and I'm curious to see if we see a similar assert when testing
the modulo.Do you understand why it triggered? Because I don't immediately. The fact that
it triggered only on windows, where the compiler is rather different, makes it
worth understanding imo.
I dug into the ASM for both GCC 15.1 and MSVC 19.latest (thanks
godbolt.org!) for x86_64 and there was a critical difference. It starts
with the fact that I'd used uint32 for my NBuffersPow2Mask rather than
uint64. That then translates to two different compiled outputs for
clock_read() (was: clock_modulo()).
gcc-15.1 -O2
clock_read(unsigned long long): and edi, DWORD PTR NBuffersPow2Mask[rip]
mov edx, DWORD PTR NBuffers[rip] mov rax, rdi sub rax, rdx cmp rdi, rdx
cmovb rax, rdi ret
msvc-19.latest /O2
hand$ = 8 unsigned int clock_read(unsigned int64) PROC ; clock_read,
COMDAT mov edx, ecx and rdx, QWORD PTR unsigned int64 NBuffersPow2Mask ;
NBuffersPow2Mask mov ecx, DWORD PTR unsigned int NBuffers ; NBuffers mov
eax, edx sub eax, ecx cmp rdx, rcx cmovb eax, edx ret 0 unsigned int
clock_read(unsigned __int64) ENDP ; clock_read
Here's what I think was happening, the MSVC compiler produced assembly
for "hand & NBuffersPow2Mask" that uses "rdx QWORD" while GCC uses "edi
DWORD". The 32-bit AND operation (edi) automatically zeros the upper 32
bits of rdi after performing the and with the uint64 value of hand while
"rdx QWORD" does not potentially leaving some of the upper 32 bits set.
My guess is that on Windows when the value of the clock hand exceeded
UINT32_MAX (as can happen in as little as 3 seconds in a tight loop but
likely took longer in the test run) the bits not masked out would
inflate the resulting value which would be > NBufers and also differ
from the simple modulo calculation causing the failed assertion.
Changing the value of NBuffersPow2Mask to uint64 and using similarly
sized types in these functions more closely aligns the assembly code and
should fix this.
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/* * Compute strategy_delta = how many buffers have been scanned by the - * clock sweep since last time. If first time through, assume none. Then - * see if we are still ahead of the clock sweep, and if so, how many + * clock-sweep since last time. If first time through, assume none. Then + * see if we are still ahead of the clock-sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { - int32 passes_delta = strategy_passes - prev_strategy_passes; + int32 passes_delta; + + if (unlikely(prev_strategy_passes > strategy_passes)) + { + /* wrap-around case */ + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); + } + else + { + passes_delta = (int32) (strategy_passes - prev_strategy_passes); + }strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;That seems somewhat independent of the rest of the change, or am I missing something?
That change is there to cover the possibility of someone managing to
overflow and wrap a uint64 which is *highly* unlikely.That risk existed previously too - I'm not against shoring things up, I'd just
do it in a precursor commit, to make this easier to review.If this degree of paranoia isn't required I'm happy to remove it.
That does indeed seem really unlikely. Assuming that postgres stays up for 10
years without a single restart, it'd be ~59 billion ticks a second.
Agreed, it's overkill.
I don't mind a defense, but I think we'd be better off putting it into
ClockSweepTick() or such, simply erroring out if we ever hit this. It's
unlikely that we'd get (and keep) all the relevant untested code correct ime.
Then we also can assert that prev_strategy_passes <= strategy_passes.
Added assertions and comments to explain decision.
Greetings,
Andres Freund
rebased onto ce6513e96a1 patch set v7 attached and available on GitHub
[1]: https://github.com/gburd/postgres/pull/9
-greg
[1]: https://github.com/gburd/postgres/pull/9
[2]: https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v7
Attachments:
v7-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchtext/x-patch; charset=UTF-8; name=v7-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patchDownload
From ac7132fca3678ff7d942c6ce4112e4ff1fd77173 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v7 1/2] Eliminate the freelist from the buffer manager and
depend on clock-sweep
This set of changes removes the list of available buffers and instead
simply uses the clock-sweep algorithm to find and return an available
buffer. While on the surface this appears to be removing an
optimization it is in fact eliminating code that induces overhead in the
form of synchronization that is problemmatic for multi-core systems.
This also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.
---
contrib/pg_prewarm/autoprewarm.c | 31 ++++---
src/backend/storage/buffer/README | 42 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +------
src/backend/storage/buffer/freelist.c | 120 +++-----------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 43 insertions(+), 200 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..2722b0bb443 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -370,6 +370,16 @@ apw_load_buffers(void)
apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0;
apw_state->prewarmed_blocks = 0;
+
+ /* Don't prewarm more than we can fit. */
+ if (num_elements > NBuffers)
+ {
+ num_elements = NBuffers;
+ ereport(LOG,
+ (errmsg("autoprewarm: capping prewarmed blocks to %d (shared_buffers size)",
+ NBuffers)));
+ }
+
/* Get the info position of the first block of the next database. */
while (apw_state->prewarm_start_idx < num_elements)
{
@@ -410,10 +420,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +468,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -523,10 +523,10 @@ autoprewarm_database_main(Datum main_arg)
blk = block_info[i];
/*
- * Loop until we run out of blocks to prewarm or until we run out of free
+ * Loop until we run out of blocks to prewarm or until we run out of
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -568,14 +568,13 @@ autoprewarm_database_main(Datum main_arg)
/*
* We have a relation; now let's loop until we find a valid fork of
- * the relation or we run out of free buffers. Once we've read from
- * all valid forks or run out of options, we'll close the relation and
+ * the relation or we run out of buffers. Once we've read from all
+ * valid forks or run out of options, we'll close the relation and
* move on.
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..cd52effd911 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle when there are no free buffers available,
+we use a simple clock-sweep algorithm, which avoids the need to take
+system-wide locks during common operations. It works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
@@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6afdd28dba6..af5ef025229 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2262,11 +2255,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..162c140fb9d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -164,17 +156,16 @@ ClockSweepTick(void)
}
/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
+ * have_free_buffer -- check if we've filled the buffer pool at startup
*
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
+ * Used exclusively by autoprewarm.
*/
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
+ uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+
+ if (hand < NBuffers)
return true;
else
return false;
@@ -243,75 +234,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We keep an approximate count of buffer allocation requests so that the
+ * bgwriter can estimate the rate of buffer consumption. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +286,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +411,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..00eade63971 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
v7-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patchtext/x-patch; charset=UTF-8; name=v7-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patchDownload
From ad47fe649103903a8eab1b4feda42d7e8aefd93c Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v7 2/2] Remove the buffer_strategy_lock and make the clock
hand a 64 bit atomic
Change nextVictimBuffer to an atomic uint64 and simply atomically
increment it by 1 at each tick. The next victim buffer is the the value
of nextVictimBuffer modulo the number of buffers (NBuffers). Modulo can
be expensive so we implement that as if the value of NBuffers was
requied to be a power of 2 and account for the difference. The value of
nextVictimBuffer, because it is only ever incremented, now encodes
enough information to provide the number of completed passes of the
clock-sweep algorithm as well. This eliminates the need for a separate
counter and related maintainance. Wrap-around of nextVictimBuffer
would require 10 years at ~59 billion ticks per-second without restart,
should that happen restart the server and upgrade it's out of date.
With the removal of the freelist and completePasses none of remaining
items in the BufferStrategyControl structure require strict coordination
and so it is possible to eliminate the buffer_strategy_lock as well.
---
src/backend/storage/buffer/README | 48 +++----
src/backend/storage/buffer/bufmgr.c | 19 ++-
src/backend/storage/buffer/freelist.c | 196 ++++++++++++--------------
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 +-
5 files changed, 133 insertions(+), 136 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..d1ab222eeb8 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-To choose a victim buffer to recycle when there are no free buffers available,
-we use a simple clock-sweep algorithm, which avoids the need to take
-system-wide locks during common operations. It works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+which avoids the need to take system-wide locks during common operations. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer is protected by the
-buffer_strategy_lock.
+through all the available buffers. nextVictimBuffer and completePasses are
+atomic values.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
4. Pin the selected buffer, and return.
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
@@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so it needs only to
+spinlock each buffer header for long enough to check the dirtybit. Even
+without that assumption, the writer only needs to take the lock long enough to
+read the variable value, not while scanning the buffers. (This is a very
+substantial improvement in the contention cost of the writer compared to PG
+8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af5ef025229..a1c711f4d8b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the clock sweep currently is, and how many buffer
+ * Find out where the clock-sweep currently is, and how many buffer
* allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,15 +3664,24 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
*/
if (saved_info_valid)
{
- int32 passes_delta = strategy_passes - prev_strategy_passes;
+ int32 passes_delta;
+
+ /*
+ * Should the uint64 hand of the clock-sweep strategy ever wrap, which
+ * would roughtly require 10 years of continuous operation at ~59
+ * billion ticks per-second without restart, we give up.
+ */
+ Assert(prev_strategy_passes <= strategy_passes);
+
+ passes_delta = (int32) (strategy_passes - prev_strategy_passes);
strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 162c140fb9d..6c41cd0b233 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,21 +29,17 @@
*/
typedef struct
{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
- * this isn't a concrete buffer - we only ever increase the value. So, to
- * get an actual buffer, it needs to be used modulo NBuffers.
+ * This is used as both the clock-sweep hand and the number of of complete
+ * passes through the buffer pool. The lower bits below NBuffers are the
+ * clock-sweep and the upper bits are the number of complete passes.
*/
- pg_atomic_uint32 nextVictimBuffer;
+ pg_atomic_uint64 nextVictimBuffer;
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -83,12 +79,71 @@ typedef struct BufferAccessStrategyData
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
} BufferAccessStrategyData;
+static uint64 NBuffersPow2Mask; /* Next power-of-2 >= NBuffers - 1 */
+static uint32 NBuffersPow2Shift; /* Amount to bitshift for division */
+static uint32 NBuffersPerCycle; /* Number of buffers in a complete cycle */
/* Prototypes for internal functions */
static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+static inline uint32 clock_passes(uint64 hand);
+static inline uint32 clock_read(uint64 hand);
+
+ /*
+ * Calculate the number of complete passes through the buffer pool that have
+ * happened thus far. A "pass" is defined as the clock hand moving through
+ * all the buffers (NBuffers) in the pool once. Our clock hand is a 64-bit
+ * counter that only increases. The number of passes is the upper bits of the
+ * counter divided by NBuffers.
+ */
+static inline uint32
+clock_passes(uint64 hand)
+{
+ uint32 result;
+
+ /* Calculate complete next power-of-2 cycles by bitshifting */
+ uint64 pow2_passes = hand >> NBuffersPow2Shift;
+
+ /* Determine the hand's current position in the cycle */
+ uint64 masked_hand = hand & NBuffersPow2Mask;
+
+ /* Has the hand passed NBuffers yet? */
+ uint32 extra_passes = (masked_hand >= NBuffers) ? 1 : 0;
+
+ /*
+ * Combine total passes, multiply complete power-of-2 cycles by passes
+ * per-cycle, then add any extra pass from the current incomplete cycle.
+ */
+ result = (uint32) (pow2_passes * NBuffersPerCycle) + extra_passes;
+
+ Assert(result <= UINT32_MAX);
+ Assert(result == ((uint32) (hand / NBuffers)));
+
+ return result;
+}
+
+ /*
+ * The hand's value is a 64-bit counter that only increases, so its position
+ * is determined by the lower bits of the counter modulo by NBuffers. To
+ * avoid the modulo operation we use the next power-of-2 mask and adjust for
+ * the difference.
+ */
+static inline uint32
+clock_read(uint64 hand)
+{
+ /* Determine the hand's current position in the cycle */
+ uint64 result = hand & NBuffersPow2Mask;
+
+ /* Adjust if the next power of 2 masked counter is more than NBuffers */
+ if (result >= NBuffers)
+ result -= NBuffers;
+
+ Assert(result == (uint32) (hand % NBuffers));
+
+ return result;
+}
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -99,78 +154,25 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
static inline uint32
ClockSweepTick(void)
{
+ uint64 hand;
uint32 victim;
/*
* Atomically move hand ahead one buffer - if there's several processes
* doing this, this can lead to buffers being returned slightly out of
- * apparent order.
+ * apparent order. Continuous operation of the clock-sweep algorithm
+ * without restart for /estimates range between 10 and 200 years/ will
+ * wrap the clock hand, so we force a restart by asserting here.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
- if (victim >= NBuffers)
- {
- uint32 originalVictim = victim;
+ hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
+ Assert(hand < UINT64_MAX);
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ victim = clock_read(hand);
+ Assert(victim < NBuffers);
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
- {
- uint32 expected;
- uint32 wrapped;
- bool success = false;
-
- expected = originalVictim + 1;
-
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- wrapped = expected % NBuffers;
-
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
- }
return victim;
}
-/*
- * have_free_buffer -- check if we've filled the buffer pool at startup
- *
- * Used exclusively by autoprewarm.
- */
-bool
-have_free_buffer(void)
-{
- uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
-
- if (hand < NBuffers)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -193,10 +195,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -241,7 +240,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock sweep" algorithm to find a free buffer */
+ /* Use the "clock-sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -297,32 +296,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* allocs if non-NULL pointers are passed. The alloc count is reset after
* being read.
*/
-int
+uint32
StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
- uint32 nextVictimBuffer;
- int result;
+ uint64 counter;
+ uint32 result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+ result = clock_read(counter);
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
-
- /*
- * Additionally add the number of wraparounds that happened before
- * completePasses could be incremented. C.f. ClockSweepTick().
- */
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes = clock_passes(counter);
}
if (num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -337,21 +329,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
* StrategyShmemSize
*
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
*
* Note: for somewhat historical reasons, the buffer lookup hashtable size
* is also determined here.
@@ -404,18 +389,25 @@ StrategyInitialize(bool init)
if (!found)
{
+ uint32 NBuffersPow2;
+
/*
* Only done once, usually in postmaster
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize combined clock-sweep pointer/complete passes counter */
+ pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
+ /* Find the smallest power of 2 larger than NBuffers */
+ NBuffersPow2 = pg_nextpower2_32(NBuffers);
+ /* Using that, find the number of positions to shift for division */
+ NBuffersPow2Shift = pg_leftmost_one_pos32(NBuffersPow2);
+ /* Calculate passes per power-of-2, typically 1 or 2 */
+ NBuffersPerCycle = NBuffersPow2 / NBuffers;
+ /* The bitmask to extract the lower portion of the clock */
+ NBuffersPow2Mask = NBuffersPow2 - 1;
/* Clear statistics */
- StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -659,7 +651,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 00eade63971..97002acb757 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
On 7/22/25 14:43, Greg Burd wrote:
On 7/21/25 14:35, Andres Freund wrote:
Hi,
On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
On 7/18/25 13:03, Andres Freund wrote:
Hello. Thanks again for taking the time to review the email and patch,
I think we're onto something good here.I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
for keeping the freelist"?Err, yes :(
Phew. :) No worries.
On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
I think we'll likely need something to replace it.
Fair, this (v5) patch doesn't yet try to address this.
TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop
prewarming
shared buffers if doing so would just evict buffers. But it's not clear
to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones? It'd make more sense to
me to
stop autoprewarm once NBuffers have been prewarmed...I had the same high level reaction, that autoprewarm was leveraging
something
convenient but not necessarily required or even correct. I'd considered
using
NBuffers as you describe due to similar intuitions, I'll dig into that idea
for
the next revision after I get to know autoprewarm a bit better.Cool. I do think that'll be good enough.
I re-added the have_free_buffer() function only now it returns false
once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
has made its first complete pass. With that I reverted my changes in
the autoprewarm module. The net should be the same behavior as before
at startup when using that module.I don't think we should have a have_free_buffer() that doesn't actually test
whether we have a free buffer, that seems too likely to cause
misunderstandings down the line. What if we instead just limit the amount of
buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
the number of to-be-loaded buffers, so that shouldn't be hard.I'm glad you said that, I wasn't thrilled with that either and I'm not
sure why I didn't just correct for that in the last patch set. I'm now
capping num_elements to NBuffers at most.Meanwhile, the tests except for Windows pass [2] for this new patch [3].
I'll dig into the Windows issues next week as well.FWIW, there are backtraces generated on windows. E.g.
000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x000000cd`827fdfc0,
unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x00000000`00000000,
unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]I.e. your new assertion failed for some reason that i can't *immediately* see.
I put that in as a precaution and as a way to communicate the intention
of the other code above it. I never imagined it would assert. I've
changed clock_read() to only assert when the modulo differs and left
that assert in the calling ClockSweepTick() function because it was
redundant and I'm curious to see if we see a similar assert when testing
the modulo.Do you understand why it triggered? Because I don't immediately. The fact that
it triggered only on windows, where the compiler is rather different, makes it
worth understanding imo.I dug into the ASM for both GCC 15.1 and MSVC 19.latest (thanks
godbolt.org!) for x86_64 and there was a critical difference. It starts
with the fact that I'd used uint32 for my NBuffersPow2Mask rather than
uint64. That then translates to two different compiled outputs for
clock_read() (was: clock_modulo()).gcc-15.1 -O2
clock_read(unsigned long long): and edi, DWORD PTR NBuffersPow2Mask[rip]
mov edx, DWORD PTR NBuffers[rip] mov rax, rdi sub rax, rdx cmp rdi, rdx
cmovb rax, rdi retmsvc-19.latest /O2
hand$ = 8 unsigned int clock_read(unsigned int64) PROC ; clock_read,
COMDAT mov edx, ecx and rdx, QWORD PTR unsigned int64 NBuffersPow2Mask ;
NBuffersPow2Mask mov ecx, DWORD PTR unsigned int NBuffers ; NBuffers mov
eax, edx sub eax, ecx cmp rdx, rcx cmovb eax, edx ret 0 unsigned int
clock_read(unsigned __int64) ENDP ; clock_readHere's what I think was happening, the MSVC compiler produced assembly
for "hand & NBuffersPow2Mask" that uses "rdx QWORD" while GCC uses "edi
DWORD". The 32-bit AND operation (edi) automatically zeros the upper 32
bits of rdi after performing the and with the uint64 value of hand while
"rdx QWORD" does not potentially leaving some of the upper 32 bits set.
My guess is that on Windows when the value of the clock hand exceeded
UINT32_MAX (as can happen in as little as 3 seconds in a tight loop but
likely took longer in the test run) the bits not masked out would
inflate the resulting value which would be > NBufers and also differ
from the simple modulo calculation causing the failed assertion.Changing the value of NBuffersPow2Mask to uint64 and using similarly
sized types in these functions more closely aligns the assembly code and
should fix this.
And it did make the code more correct, it just didn't fix that
particular bug. The bug was in the logic for my optimized modulo. Big
thank you to my PostgreSQL committer mentor Noah Misch who asked a
simple question yesterday in our monthly call, "if this algorithm is
correct, why isn't it how CPUs and/or compilers implement modulo?" Then
went on to suggest that I write an exhaustive test for it (maybe even
coerce an LLM to do that for me) and test it. So I did, and it was wrong.
I felt I'd give it one more try. It turns out there is an algorithm [1]https://stackoverflow.com/a/26047426/366692
based on some work that went into GMP a while ago [2]https://gmplib.org/~tege/divcnst-pldi94.pdf, so I implemented
it and gave a try (attached test.c). Without optimization it's 10-42%
faster, but when compiled with GCC -O3 or -Ofast that advantage goes
away and it is slightly slower. So, simplicity for the win and let the
compiler do the hard parts I guess.
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/* * Compute strategy_delta = how many buffers have been scanned by the - * clock sweep since last time. If first time through, assume none. Then - * see if we are still ahead of the clock sweep, and if so, how many + * clock-sweep since last time. If first time through, assume none. Then + * see if we are still ahead of the clock-sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { - int32 passes_delta = strategy_passes - prev_strategy_passes; + int32 passes_delta; + + if (unlikely(prev_strategy_passes > strategy_passes)) + { + /* wrap-around case */ + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); + } + else + { + passes_delta = (int32) (strategy_passes - prev_strategy_passes); + }strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;That seems somewhat independent of the rest of the change, or am I missing something?
That change is there to cover the possibility of someone managing to
overflow and wrap a uint64 which is *highly* unlikely.That risk existed previously too - I'm not against shoring things up, I'd just
do it in a precursor commit, to make this easier to review.If this degree of paranoia isn't required I'm happy to remove it.
That does indeed seem really unlikely. Assuming that postgres stays up for 10
years without a single restart, it'd be ~59 billion ticks a second.Agreed, it's overkill.
I don't mind a defense, but I think we'd be better off putting it into
ClockSweepTick() or such, simply erroring out if we ever hit this. It's
unlikely that we'd get (and keep) all the relevant untested code correct ime.
Then we also can assert that prev_strategy_passes <= strategy_passes.Added assertions and comments to explain decision.
Greetings,
Andres Freund
Patch set is now:
1) remove freelist
2) remove buffer_strategy_lock
3) abstract clock-sweep to type and API
Rebased v10 onto 5457ea46d18 at GitHub[3]https://github.com/gburd/postgres/pull/10 and CommitFest[4]https://commitfest.postgresql.org/patch/5928/,
-greg
[1]: https://stackoverflow.com/a/26047426/366692
[2]: https://gmplib.org/~tege/divcnst-pldi94.pdf
Attachments:
v10-0001-Eliminate-the-freelist-from-the-buffer-manager-a.patchtext/x-patch; charset=UTF-8; name=v10-0001-Eliminate-the-freelist-from-the-buffer-manager-a.patchDownload
From d73d862c7feaa346bf0c67d03396faae81fba717 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v10 1/3] Eliminate the freelist from the buffer manager and
depend on clock-sweep
This set of changes removes the list of available buffers and instead
simply uses the clock-sweep algorithm to find and return an available
buffer. While on the surface this appears to be removing an
optimization it is in fact eliminating code that induces overhead in the
form of synchronization that is problemmatic for multi-core systems.
This also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.
---
contrib/pg_prewarm/autoprewarm.c | 31 ++++---
src/backend/storage/buffer/README | 42 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +------
src/backend/storage/buffer/freelist.c | 120 +++-----------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 43 insertions(+), 200 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..2722b0bb443 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -370,6 +370,16 @@ apw_load_buffers(void)
apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0;
apw_state->prewarmed_blocks = 0;
+
+ /* Don't prewarm more than we can fit. */
+ if (num_elements > NBuffers)
+ {
+ num_elements = NBuffers;
+ ereport(LOG,
+ (errmsg("autoprewarm: capping prewarmed blocks to %d (shared_buffers size)",
+ NBuffers)));
+ }
+
/* Get the info position of the first block of the next database. */
while (apw_state->prewarm_start_idx < num_elements)
{
@@ -410,10 +420,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +468,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -523,10 +523,10 @@ autoprewarm_database_main(Datum main_arg)
blk = block_info[i];
/*
- * Loop until we run out of blocks to prewarm or until we run out of free
+ * Loop until we run out of blocks to prewarm or until we run out of
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -568,14 +568,13 @@ autoprewarm_database_main(Datum main_arg)
/*
* We have a relation; now let's loop until we find a valid fork of
- * the relation or we run out of free buffers. Once we've read from
- * all valid forks or run out of options, we'll close the relation and
+ * the relation or we run out of buffers. Once we've read from all
+ * valid forks or run out of options, we'll close the relation and
* move on.
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..cd52effd911 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle when there are no free buffers available,
+we use a simple clock-sweep algorithm, which avoids the need to take
+system-wide locks during common operations. It works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
@@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6afdd28dba6..af5ef025229 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2262,11 +2255,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..162c140fb9d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -164,17 +156,16 @@ ClockSweepTick(void)
}
/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
+ * have_free_buffer -- check if we've filled the buffer pool at startup
*
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
+ * Used exclusively by autoprewarm.
*/
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
+ uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+
+ if (hand < NBuffers)
return true;
else
return false;
@@ -243,75 +234,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We keep an approximate count of buffer allocation requests so that the
+ * bgwriter can estimate the rate of buffer consumption. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +286,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +411,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..00eade63971 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
v10-0002-Remove-the-buffer_strategy_lock-and-make-the-clo.patchtext/x-patch; charset=UTF-8; name=v10-0002-Remove-the-buffer_strategy_lock-and-make-the-clo.patchDownload
From bfde3244542b87c369d339c859578817a2089fd8 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v10 2/3] Remove the buffer_strategy_lock and make the clock
hand a 64 bit atomic
Change nextVictimBuffer to an atomic uint64 and simply atomically
increment it by 1 at each tick. The next victim buffer is the the value
of nextVictimBuffer modulo the number of buffers (NBuffers). The number
of complete passes of the clock-sweep hand is nextVictimBuffer divided
by NBuffers. Wrap-around of nextVictimBuffer would require 10 years at
~59 billion ticks per-second without restart, so unlikely that we ignore
that case entirely.
With the removal of the freelist and completePasses none of remaining
items in the BufferStrategyControl structure require strict coordination
and so it is possible to eliminate the buffer_strategy_lock as well.
---
src/backend/storage/buffer/README | 48 ++++-----
src/backend/storage/buffer/bufmgr.c | 20 +++-
src/backend/storage/buffer/freelist.c | 139 ++++++--------------------
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 +-
5 files changed, 71 insertions(+), 142 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..d1ab222eeb8 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-To choose a victim buffer to recycle when there are no free buffers available,
-we use a simple clock-sweep algorithm, which avoids the need to take
-system-wide locks during common operations. It works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+which avoids the need to take system-wide locks during common operations. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer is protected by the
-buffer_strategy_lock.
+through all the available buffers. nextVictimBuffer and completePasses are
+atomic values.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
4. Pin the selected buffer, and return.
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
@@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so it needs only to
+spinlock each buffer header for long enough to check the dirtybit. Even
+without that assumption, the writer only needs to take the lock long enough to
+read the variable value, not while scanning the buffers. (This is a very
+substantial improvement in the contention cost of the writer compared to PG
+8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af5ef025229..09d054a616f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the clock sweep currently is, and how many buffer
+ * Find out where the clock-sweep currently is, and how many buffer
* allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
*/
if (saved_info_valid)
{
- int32 passes_delta = strategy_passes - prev_strategy_passes;
+ int32 passes_delta;
+
+ /*
+ * It is highly unlikely that the uint64 hand of the clock-sweep
+ * would ever wrap, that would roughtly require 10 years of
+ * continuous operation at ~59 billion ticks per-second without
+ * restart.
+ */
+ Assert(prev_strategy_passes <= strategy_passes);
+
+ passes_delta = strategy_passes - prev_strategy_passes;
strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 162c140fb9d..906b35be4c1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -27,23 +27,20 @@
/*
* The shared freelist control information.
*/
-typedef struct
-{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
+typedef struct {
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
- * this isn't a concrete buffer - we only ever increase the value. So, to
- * get an actual buffer, it needs to be used modulo NBuffers.
+ * The clock-sweep hand is atomically updated by 1 at every tick. Use the
+ * macro CLOCK_HAND_POSITION() o find the next victim's index in the
+ * BufferDescriptor array. To calculate the number of times the clock-sweep
+ * hand has made a complete pass through all available buffers in the pool
+ * divide NBuffers.
*/
- pg_atomic_uint32 nextVictimBuffer;
+ pg_atomic_uint64 nextVictimBuffer;
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -83,13 +80,15 @@ typedef struct BufferAccessStrategyData
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
} BufferAccessStrategyData;
-
/* Prototypes for internal functions */
static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+#define CLOCK_HAND_POSITION(counter) \
+ ((counter) & 0xFFFFFFFF) % NBuffers
+
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
*
@@ -99,6 +98,7 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
static inline uint32
ClockSweepTick(void)
{
+ uint64 hand = UINT64_MAX;
uint32 victim;
/*
@@ -106,71 +106,14 @@ ClockSweepTick(void)
* doing this, this can lead to buffers being returned slightly out of
* apparent order.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
- if (victim >= NBuffers)
- {
- uint32 originalVictim = victim;
-
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
- {
- uint32 expected;
- uint32 wrapped;
- bool success = false;
+ victim = CLOCK_HAND_POSITION(hand);
+ Assert(victim < NBuffers);
- expected = originalVictim + 1;
-
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- wrapped = expected % NBuffers;
-
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
- }
return victim;
}
-/*
- * have_free_buffer -- check if we've filled the buffer pool at startup
- *
- * Used exclusively by autoprewarm.
- */
-bool
-have_free_buffer(void)
-{
- uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
-
- if (hand < NBuffers)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -193,10 +136,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -241,7 +181,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock sweep" algorithm to find a free buffer */
+ /* Use the "clock-sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -292,37 +232,30 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* The result is the buffer index of the best buffer to sync first.
* BgBufferSync() will proceed circularly around the buffer array from there.
*
- * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed. The alloc count is reset after
- * being read.
+ * In addition, we return the completed-pass count and the count of recent
+ * buffer allocs if non-NULL pointers are passed. The alloc count is reset
+ * after being read.
*/
-int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
-{
- uint32 nextVictimBuffer;
- int result;
+uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) {
+ uint64 counter = UINT64_MAX; uint32 result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+ result = CLOCK_HAND_POSITION(counter);
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
-
/*
- * Additionally add the number of wraparounds that happened before
- * completePasses could be incremented. C.f. ClockSweepTick().
+ * The number of complete passes is the counter divided by NBuffers
+ * because the clock hand is a 64-bit counter that only increases.
*/
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes = (uint32) (counter / NBuffers);
}
if (num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -337,21 +270,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
* StrategyShmemSize
*
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
*
* Note: for somewhat historical reasons, the buffer lookup hashtable size
* is also determined here.
@@ -409,13 +335,10 @@ StrategyInitialize(bool init)
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize combined clock-sweep pointer/complete passes counter */
+ pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
/* Clear statistics */
- StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -659,7 +582,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 00eade63971..97002acb757 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
v10-0003-Abstract-clock-sweep-buffer-replacement-algorith.patchtext/x-patch; charset=UTF-8; name=v10-0003-Abstract-clock-sweep-buffer-replacement-algorith.patchDownload
From 8195dd714cfa14ffd68a04a36ac5496eba421897 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 25 Jul 2025 11:53:10 -0400
Subject: [PATCH v10 3/3] Abstract clock-sweep buffer replacement algorithm
Re-author the clock-sweep algorithm such that it maintains its own state
and has a well defined API.
---
src/backend/storage/buffer/README | 20 +++---
src/backend/storage/buffer/freelist.c | 88 ++++++++++++++-------------
src/tools/pgindent/typedefs.list | 1 +
3 files changed, 57 insertions(+), 52 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index d1ab222eeb8..3f31d04d572 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -166,14 +166,13 @@ small limit value) whenever the buffer is pinned. (This requires only the
buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
-The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer and completePasses are
-atomic values.
+The "clock hand" is a buffer index that moves circularly through all the
+available buffers.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time.
+1. Select the buffer pointed to by the clock hand, and circularly advance it
+for next time.
2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
used. Decrement its usage count (if nonzero), return to step 3 to examine the
@@ -235,13 +234,12 @@ Background Writer's Processing
------------------------------
The background writer is designed to write out pages that are likely to be
-recycled soon, thereby offloading the writing work from active backends.
-To do this, it scans forward circularly from the current position of
-nextVictimBuffer (which it does not change!), looking for buffers that are
-dirty and not pinned nor marked with a positive usage count. It pins,
-writes, and releases any such buffer.
+recycled soon, thereby offloading the writing work from active backends. To do
+this, it scans forward circularly from the current position of clock (which it
+does not change!), looking for buffers that are dirty and not pinned nor marked
+with a positive usage count. It pins, writes, and releases any such buffer.
-We enforce reading nextVictimBuffer within an atomic action so it needs only to
+We enforce reading the clock hand within an atomic action so it needs only to
spinlock each buffer header for long enough to check the dirtybit. Even
without that assumption, the writer only needs to take the lock long enough to
read the variable value, not while scanning the buffers. (This is a very
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 906b35be4c1..71839dfdee9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -23,19 +23,22 @@
#define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var))))
+typedef struct ClockSweep
+{
+ pg_atomic_uint64 counter; /* Only incremented by one */
+ uint32_t size; /* Size of the clock */
+} ClockSweep;
/*
* The shared freelist control information.
*/
-typedef struct {
+typedef struct
+{
/*
- * The clock-sweep hand is atomically updated by 1 at every tick. Use the
- * macro CLOCK_HAND_POSITION() o find the next victim's index in the
- * BufferDescriptor array. To calculate the number of times the clock-sweep
- * hand has made a complete pass through all available buffers in the pool
- * divide NBuffers.
+ * The next buffer available for use is determined by the clock-sweep
+ * algorithm.
*/
- pg_atomic_uint64 nextVictimBuffer;
+ ClockSweep clock;
/*
* Statistics. These counters should be wide enough that they can't
@@ -86,32 +89,40 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
-#define CLOCK_HAND_POSITION(counter) \
- ((counter) & 0xFFFFFFFF) % NBuffers
+static void
+ClockSweepInit(ClockSweep *sweep, uint32 size)
+{
+ pg_atomic_init_u64(&sweep->counter, 0);
+ sweep->size = size;
+}
-/*
- * ClockSweepTick - Helper routine for StrategyGetBuffer()
- *
- * Move the clock hand one buffer ahead of its current position and return the
- * id of the buffer now under the hand.
- */
+/* Extract the number of complete cycles from the clock hand */
static inline uint32
-ClockSweepTick(void)
+ClockSweepCycles(ClockSweep *sweep)
{
- uint64 hand = UINT64_MAX;
- uint32 victim;
+ uint64 current = pg_atomic_read_u64(&sweep->counter);
- /*
- * Atomically move hand ahead one buffer - if there's several processes
- * doing this, this can lead to buffers being returned slightly out of
- * apparent order.
- */
- hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
+ return current / sweep->size;
+}
+
+/* Return the current position of the clock's hand modulo size */
+static inline uint32
+ClockSweepPosition(ClockSweep *sweep)
+{
+ uint64 counter = pg_atomic_read_u64(&sweep->counter);
+
+ return ((counter) & 0xFFFFFFFF) % sweep->size;
+}
- victim = CLOCK_HAND_POSITION(hand);
- Assert(victim < NBuffers);
+/*
+ * Move the clock hand ahead one and return its new position.
+ */
+static inline uint32
+ClockSweepTick(ClockSweep *sweep)
+{
+ uint64 counter = pg_atomic_fetch_add_u64(&sweep->counter, 1);
- return victim;
+ return ((counter) & 0xFFFFFFFF) % sweep->size;
}
/*
@@ -181,11 +192,11 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock-sweep" algorithm to find a free buffer */
+ /* Use the clock-sweep algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
- buf = GetBufferDescriptor(ClockSweepTick());
+ buf = GetBufferDescriptor(ClockSweepTick(&StrategyControl->clock));
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -236,19 +247,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* buffer allocs if non-NULL pointers are passed. The alloc count is reset
* after being read.
*/
-uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) {
- uint64 counter = UINT64_MAX; uint32 result;
-
- counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
- result = CLOCK_HAND_POSITION(counter);
+uint32
+StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+{
+ uint32 result = ClockSweepPosition(&StrategyControl->clock);
if (complete_passes)
{
- /*
- * The number of complete passes is the counter divided by NBuffers
- * because the clock hand is a 64-bit counter that only increases.
- */
- *complete_passes = (uint32) (counter / NBuffers);
+ *complete_passes = ClockSweepCycles(&StrategyControl->clock);
}
if (num_buf_alloc)
@@ -335,8 +341,8 @@ StrategyInitialize(bool init)
*/
Assert(init);
- /* Initialize combined clock-sweep pointer/complete passes counter */
- pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize the clock-sweep algorithm */
+ ClockSweepInit(&StrategyControl->clock, NBuffers);
/* Clear statistics */
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4353befab99..1cbfca592d9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -426,6 +426,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClockSweep
ClonePtrType
ClosePortalStmt
ClosePtrType
--
2.49.0
On 7/25/25 15:02, Greg Burd wrote:
Patch set is now:
1) remove freelist
2) remove buffer_strategy_lock
3) abstract clock-sweep to type and API
-greg
Somehow including the test.c file as an attachment on my last email
confused the CI and it didn't test the v10 patch set (which did pass in
GitHub CI on my fork [1]https://github.com/gburd/postgres/pull/10/checks). Here's v11 unchanged from v10 except rebased
onto 258bf0a2ea8 cf PG 19-2 5928 [2]https://commitfest.postgresql.org/patch/5928/.
best,
-greg
Attachments:
v11-0001-Eliminate-the-freelist-from-the-buffer-manager-a.patchtext/x-patch; charset=UTF-8; name=v11-0001-Eliminate-the-freelist-from-the-buffer-manager-a.patchDownload
From 706dcef951f4ec33b492ace481d31aacb02fd650 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v11 1/3] Eliminate the freelist from the buffer manager and
depend on clock-sweep
This set of changes removes the list of available buffers and instead
simply uses the clock-sweep algorithm to find and return an available
buffer. While on the surface this appears to be removing an
optimization it is in fact eliminating code that induces overhead in the
form of synchronization that is problemmatic for multi-core systems.
This also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.
---
contrib/pg_prewarm/autoprewarm.c | 31 ++++---
src/backend/storage/buffer/README | 42 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +------
src/backend/storage/buffer/freelist.c | 120 +++-----------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 43 insertions(+), 200 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..2722b0bb443 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -370,6 +370,16 @@ apw_load_buffers(void)
apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0;
apw_state->prewarmed_blocks = 0;
+
+ /* Don't prewarm more than we can fit. */
+ if (num_elements > NBuffers)
+ {
+ num_elements = NBuffers;
+ ereport(LOG,
+ (errmsg("autoprewarm: capping prewarmed blocks to %d (shared_buffers size)",
+ NBuffers)));
+ }
+
/* Get the info position of the first block of the next database. */
while (apw_state->prewarm_start_idx < num_elements)
{
@@ -410,10 +420,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +468,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -523,10 +523,10 @@ autoprewarm_database_main(Datum main_arg)
blk = block_info[i];
/*
- * Loop until we run out of blocks to prewarm or until we run out of free
+ * Loop until we run out of blocks to prewarm or until we run out of
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -568,14 +568,13 @@ autoprewarm_database_main(Datum main_arg)
/*
* We have a relation; now let's loop until we find a valid fork of
- * the relation or we run out of free buffers. Once we've read from
- * all valid forks or run out of options, we'll close the relation and
+ * the relation or we run out of buffers. Once we've read from all
+ * valid forks or run out of options, we'll close the relation and
* move on.
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..cd52effd911 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle when there are no free buffers available,
+we use a simple clock-sweep algorithm, which avoids the need to take
+system-wide locks during common operations. It works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
@@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6afdd28dba6..af5ef025229 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2262,11 +2255,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..162c140fb9d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -164,17 +156,16 @@ ClockSweepTick(void)
}
/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
+ * have_free_buffer -- check if we've filled the buffer pool at startup
*
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
+ * Used exclusively by autoprewarm.
*/
bool
have_free_buffer(void)
{
- if (StrategyControl->firstFreeBuffer >= 0)
+ uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+
+ if (hand < NBuffers)
return true;
else
return false;
@@ -243,75 +234,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We keep an approximate count of buffer allocation requests so that the
+ * bgwriter can estimate the rate of buffer consumption. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +286,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +411,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..00eade63971 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
v11-0002-Remove-the-buffer_strategy_lock-and-make-the-clo.patchtext/x-patch; charset=UTF-8; name=v11-0002-Remove-the-buffer_strategy_lock-and-make-the-clo.patchDownload
From fdf3e231bb2bb07d29cf486a616ea8c4ae365abe Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 11 Jul 2025 09:05:45 -0400
Subject: [PATCH v11 2/3] Remove the buffer_strategy_lock and make the clock
hand a 64 bit atomic
Change nextVictimBuffer to an atomic uint64 and simply atomically
increment it by 1 at each tick. The next victim buffer is the the value
of nextVictimBuffer modulo the number of buffers (NBuffers). The number
of complete passes of the clock-sweep hand is nextVictimBuffer divided
by NBuffers. Wrap-around of nextVictimBuffer would require 10 years at
~59 billion ticks per-second without restart, so unlikely that we ignore
that case entirely.
With the removal of the freelist and completePasses none of remaining
items in the BufferStrategyControl structure require strict coordination
and so it is possible to eliminate the buffer_strategy_lock as well.
---
src/backend/storage/buffer/README | 48 ++++-----
src/backend/storage/buffer/bufmgr.c | 20 +++-
src/backend/storage/buffer/freelist.c | 139 ++++++--------------------
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 +-
5 files changed, 71 insertions(+), 142 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index cd52effd911..d1ab222eeb8 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-To choose a victim buffer to recycle when there are no free buffers available,
-we use a simple clock-sweep algorithm, which avoids the need to take
-system-wide locks during common operations. It works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+which avoids the need to take system-wide locks during common operations. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer is protected by the
-buffer_strategy_lock.
+through all the available buffers. nextVictimBuffer and completePasses are
+atomic values.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
+1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time.
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
-
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 3 to examine the
+next buffer.
4. Pin the selected buffer, and return.
@@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
@@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are
dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+We enforce reading nextVictimBuffer within an atomic action so it needs only to
+spinlock each buffer header for long enough to check the dirtybit. Even
+without that assumption, the writer only needs to take the lock long enough to
+read the variable value, not while scanning the buffers. (This is a very
+substantial improvement in the contention cost of the writer compared to PG
+8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af5ef025229..09d054a616f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3593,7 +3593,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the clock sweep currently is, and how many buffer
+ * Find out where the clock-sweep currently is, and how many buffer
* allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
*/
if (saved_info_valid)
{
- int32 passes_delta = strategy_passes - prev_strategy_passes;
+ int32 passes_delta;
+
+ /*
+ * It is highly unlikely that the uint64 hand of the clock-sweep
+ * would ever wrap, that would roughtly require 10 years of
+ * continuous operation at ~59 billion ticks per-second without
+ * restart.
+ */
+ Assert(prev_strategy_passes <= strategy_passes);
+
+ passes_delta = strategy_passes - prev_strategy_passes;
strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 162c140fb9d..906b35be4c1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -27,23 +27,20 @@
/*
* The shared freelist control information.
*/
-typedef struct
-{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
+typedef struct {
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
- * this isn't a concrete buffer - we only ever increase the value. So, to
- * get an actual buffer, it needs to be used modulo NBuffers.
+ * The clock-sweep hand is atomically updated by 1 at every tick. Use the
+ * macro CLOCK_HAND_POSITION() o find the next victim's index in the
+ * BufferDescriptor array. To calculate the number of times the clock-sweep
+ * hand has made a complete pass through all available buffers in the pool
+ * divide NBuffers.
*/
- pg_atomic_uint32 nextVictimBuffer;
+ pg_atomic_uint64 nextVictimBuffer;
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -83,13 +80,15 @@ typedef struct BufferAccessStrategyData
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
} BufferAccessStrategyData;
-
/* Prototypes for internal functions */
static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+#define CLOCK_HAND_POSITION(counter) \
+ ((counter) & 0xFFFFFFFF) % NBuffers
+
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
*
@@ -99,6 +98,7 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
static inline uint32
ClockSweepTick(void)
{
+ uint64 hand = UINT64_MAX;
uint32 victim;
/*
@@ -106,71 +106,14 @@ ClockSweepTick(void)
* doing this, this can lead to buffers being returned slightly out of
* apparent order.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
- if (victim >= NBuffers)
- {
- uint32 originalVictim = victim;
-
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
- {
- uint32 expected;
- uint32 wrapped;
- bool success = false;
+ victim = CLOCK_HAND_POSITION(hand);
+ Assert(victim < NBuffers);
- expected = originalVictim + 1;
-
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- wrapped = expected % NBuffers;
-
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
- }
return victim;
}
-/*
- * have_free_buffer -- check if we've filled the buffer pool at startup
- *
- * Used exclusively by autoprewarm.
- */
-bool
-have_free_buffer(void)
-{
- uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
-
- if (hand < NBuffers)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -193,10 +136,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -241,7 +181,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock sweep" algorithm to find a free buffer */
+ /* Use the "clock-sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -292,37 +232,30 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* The result is the buffer index of the best buffer to sync first.
* BgBufferSync() will proceed circularly around the buffer array from there.
*
- * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed. The alloc count is reset after
- * being read.
+ * In addition, we return the completed-pass count and the count of recent
+ * buffer allocs if non-NULL pointers are passed. The alloc count is reset
+ * after being read.
*/
-int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
-{
- uint32 nextVictimBuffer;
- int result;
+uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) {
+ uint64 counter = UINT64_MAX; uint32 result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
+ result = CLOCK_HAND_POSITION(counter);
if (complete_passes)
{
- *complete_passes = StrategyControl->completePasses;
-
/*
- * Additionally add the number of wraparounds that happened before
- * completePasses could be incremented. C.f. ClockSweepTick().
+ * The number of complete passes is the counter divided by NBuffers
+ * because the clock hand is a 64-bit counter that only increases.
*/
- *complete_passes += nextVictimBuffer / NBuffers;
+ *complete_passes = (uint32) (counter / NBuffers);
}
if (num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -337,21 +270,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
* StrategyShmemSize
*
- * estimate the size of shared memory used by the freelist-related structures.
+ * Estimate the size of shared memory used by the freelist-related structures.
*
* Note: for somewhat historical reasons, the buffer lookup hashtable size
* is also determined here.
@@ -409,13 +335,10 @@ StrategyInitialize(bool init)
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize combined clock-sweep pointer/complete passes counter */
+ pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
/* Clear statistics */
- StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -659,7 +582,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3da9c41ee1d..7a34f5e430a 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 00eade63971..97002acb757 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * clock-sweeps to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
v11-0003-Abstract-clock-sweep-buffer-replacement-algorith.patchtext/x-patch; charset=UTF-8; name=v11-0003-Abstract-clock-sweep-buffer-replacement-algorith.patchDownload
From 3c01069c5e51cb810949289cbe7833b10632c448 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Fri, 25 Jul 2025 11:53:10 -0400
Subject: [PATCH v11 3/3] Abstract clock-sweep buffer replacement algorithm
Re-author the clock-sweep algorithm such that it maintains its own state
and has a well defined API.
---
src/backend/storage/buffer/README | 20 +++---
src/backend/storage/buffer/freelist.c | 88 ++++++++++++++-------------
src/tools/pgindent/typedefs.list | 1 +
3 files changed, 57 insertions(+), 52 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index d1ab222eeb8..3f31d04d572 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -166,14 +166,13 @@ small limit value) whenever the buffer is pinned. (This requires only the
buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
-The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer and completePasses are
-atomic values.
+The "clock hand" is a buffer index that moves circularly through all the
+available buffers.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time.
+1. Select the buffer pointed to by the clock hand, and circularly advance it
+for next time.
2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
used. Decrement its usage count (if nonzero), return to step 3 to examine the
@@ -235,13 +234,12 @@ Background Writer's Processing
------------------------------
The background writer is designed to write out pages that are likely to be
-recycled soon, thereby offloading the writing work from active backends.
-To do this, it scans forward circularly from the current position of
-nextVictimBuffer (which it does not change!), looking for buffers that are
-dirty and not pinned nor marked with a positive usage count. It pins,
-writes, and releases any such buffer.
+recycled soon, thereby offloading the writing work from active backends. To do
+this, it scans forward circularly from the current position of clock (which it
+does not change!), looking for buffers that are dirty and not pinned nor marked
+with a positive usage count. It pins, writes, and releases any such buffer.
-We enforce reading nextVictimBuffer within an atomic action so it needs only to
+We enforce reading the clock hand within an atomic action so it needs only to
spinlock each buffer header for long enough to check the dirtybit. Even
without that assumption, the writer only needs to take the lock long enough to
read the variable value, not while scanning the buffers. (This is a very
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 906b35be4c1..71839dfdee9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -23,19 +23,22 @@
#define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var))))
+typedef struct ClockSweep
+{
+ pg_atomic_uint64 counter; /* Only incremented by one */
+ uint32_t size; /* Size of the clock */
+} ClockSweep;
/*
* The shared freelist control information.
*/
-typedef struct {
+typedef struct
+{
/*
- * The clock-sweep hand is atomically updated by 1 at every tick. Use the
- * macro CLOCK_HAND_POSITION() o find the next victim's index in the
- * BufferDescriptor array. To calculate the number of times the clock-sweep
- * hand has made a complete pass through all available buffers in the pool
- * divide NBuffers.
+ * The next buffer available for use is determined by the clock-sweep
+ * algorithm.
*/
- pg_atomic_uint64 nextVictimBuffer;
+ ClockSweep clock;
/*
* Statistics. These counters should be wide enough that they can't
@@ -86,32 +89,40 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
-#define CLOCK_HAND_POSITION(counter) \
- ((counter) & 0xFFFFFFFF) % NBuffers
+static void
+ClockSweepInit(ClockSweep *sweep, uint32 size)
+{
+ pg_atomic_init_u64(&sweep->counter, 0);
+ sweep->size = size;
+}
-/*
- * ClockSweepTick - Helper routine for StrategyGetBuffer()
- *
- * Move the clock hand one buffer ahead of its current position and return the
- * id of the buffer now under the hand.
- */
+/* Extract the number of complete cycles from the clock hand */
static inline uint32
-ClockSweepTick(void)
+ClockSweepCycles(ClockSweep *sweep)
{
- uint64 hand = UINT64_MAX;
- uint32 victim;
+ uint64 current = pg_atomic_read_u64(&sweep->counter);
- /*
- * Atomically move hand ahead one buffer - if there's several processes
- * doing this, this can lead to buffers being returned slightly out of
- * apparent order.
- */
- hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1);
+ return current / sweep->size;
+}
+
+/* Return the current position of the clock's hand modulo size */
+static inline uint32
+ClockSweepPosition(ClockSweep *sweep)
+{
+ uint64 counter = pg_atomic_read_u64(&sweep->counter);
+
+ return ((counter) & 0xFFFFFFFF) % sweep->size;
+}
- victim = CLOCK_HAND_POSITION(hand);
- Assert(victim < NBuffers);
+/*
+ * Move the clock hand ahead one and return its new position.
+ */
+static inline uint32
+ClockSweepTick(ClockSweep *sweep)
+{
+ uint64 counter = pg_atomic_fetch_add_u64(&sweep->counter, 1);
- return victim;
+ return ((counter) & 0xFFFFFFFF) % sweep->size;
}
/*
@@ -181,11 +192,11 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /* Use the "clock-sweep" algorithm to find a free buffer */
+ /* Use the clock-sweep algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
- buf = GetBufferDescriptor(ClockSweepTick());
+ buf = GetBufferDescriptor(ClockSweepTick(&StrategyControl->clock));
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -236,19 +247,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* buffer allocs if non-NULL pointers are passed. The alloc count is reset
* after being read.
*/
-uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) {
- uint64 counter = UINT64_MAX; uint32 result;
-
- counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
- result = CLOCK_HAND_POSITION(counter);
+uint32
+StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+{
+ uint32 result = ClockSweepPosition(&StrategyControl->clock);
if (complete_passes)
{
- /*
- * The number of complete passes is the counter divided by NBuffers
- * because the clock hand is a 64-bit counter that only increases.
- */
- *complete_passes = (uint32) (counter / NBuffers);
+ *complete_passes = ClockSweepCycles(&StrategyControl->clock);
}
if (num_buf_alloc)
@@ -335,8 +341,8 @@ StrategyInitialize(bool init)
*/
Assert(init);
- /* Initialize combined clock-sweep pointer/complete passes counter */
- pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize the clock-sweep algorithm */
+ ClockSweepInit(&StrategyControl->clock, NBuffers);
/* Clear statistics */
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3daba26b237..3ac88067e21 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -426,6 +426,7 @@ ClientCertName
ClientConnectionInfo
ClientData
ClientSocket
+ClockSweep
ClonePtrType
ClosePortalStmt
ClosePtrType
--
2.49.0
Hi,
I took a look at this thread / patches, mostly because it's somewhat
related to the NUMA patch series [1]/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me. Freelist is one of the things the
NUMA patches tweak, to split it into per-node partitions. So removing
the freelist would affect the NUMA patches, mostly simplifying it by
making the freelist part irrelevant.
On the whole the idea seems reasonable - I'm not against removing the
freelist, assuming Andres is right and the freelist is not very useful
anyway. So I decided to validate this assumption by running a couple of
benchmarks ...
benchmarks
----------
I did two tests:
1) run-pgbench.sh
- read-only pgbench
- scale set to different fraction of shared buffers (50%, 90%, 110%)
- tracks tps
2) run-seqscan.sh
- concurrect evic+seqscans on different tables
- shared buffers set to different fractions of total dataset sizes (50%,
90%, 110%)
- This is similar to the workload Andres suggested in [2]/messages/by-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3@euyewe6tf3z3, except that
it uses a simple SELECT instead of pg_prewarm. The patches change how
pg_prewarm decides to stop, and I didn't want to be affected by that.
- tracks tps, and average latency for the evict + select
The attached test scripts are "complete" - setup an instance. disable
checksums, set a couple GUCs, etc.
I wanted to see the effect of each patch, so tests were done on master,
and then with patches applied one by one (after fixing the compile
failures in 0001).
I did this on two EPYC machines with 96 cores (lscpu attached), and the
results are virtually the same. Attached are PDFs with results from one
of the machines.
On the whole, the impact of the patches is negligible it's within 1% in
either direction, and I don't see any particular trend / systemic change
in the behavior (e.g. regressions for some cases). Seems like a noise,
which supports the assumption the freelist is not very useful.
The one exception positive is "evict latency" which tracks latency of
pg_buffercache_evict_relation(). That got consistently a little bit
better, which makes sense as it does not need to maintain the freelist.
freelist statistics?
--------------------
There's one caveat, though. I find it tricky to "know" if the workload
actually uses a freelist. Is there a good way to determine if a buffer
was acquired from a freelist, or through the clocksweep?
I'm worried the tests might actually have empty freelists, or maybe the
freelists won't be used very much. In which case removing the freelist
has understandably no impact. But it says nothing about cases getting
buffers from freelists often ...
I can probably think about each workload and deduce if there will be
freelists. For the benchmarks described earlier, I think the situation
is this:
pgbench, shared buffers < 100% -> no freelists, uses clock-sweep
pgbench, shared buffers > 100% -> freelists, but unused (no evictions)
seqscan, shared buffers < 100% -> freelists, but clocksweep too
seqscan, shared buffers > 100% -> freelists, no clocksweep
But maybe I got it wrong for some cases? It'd be good to have a way to
collect some stats for each test, to confirm this and also to quantify
the effects. A test that gets 10% of buffers from a freelist may behave
differently from a test that gets 99% buffers from a freelist.
I propose we add a system view showing interesting information about
freelists and buffer allocation, or perhaps extend an existing one (e.g.
pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches
I simply added a new view into pg_buffercache.
I'm not suggesting this would get committed, at this point I'm more
focused on the development. Whether the view would be useful outside the
development is an open question. Also, if it adds stats about freelists
(for current master), that'd be useless once we remove them.
Now, a couple review comments about the individual patches:
0001
----
1) nitpick: adds a blank line at the end of buffer/README
2) gcc complains about missing have_free_buffer prototype, but it's no
no longer needed and can be removed
3) freelist fails to compile, because of:
freelist.c: In function ‘have_free_buffer’:
freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’
from incompatible pointer type [-Wincompatible-pointer-types]
166 | uint64 hand =
pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| pg_atomic_uint32 *
That is, it should be accessed using pg_atomic_read_u32.
CI/cfbot does not report this, because it applies all the patches, and
0002 fixes this.
0002
----
1) I think the buffer/README needs a bit more work. It now suddenly
mentions completePasses, without explaining what it's for, and how it
works with nextVictimBuffer. And then it's not mentioned again, so why
even mention it? We don't mention various other fields ...
It's true completePasses is not a new thing, and probably should have
been already explained by the README. Still, it seems a bit confusing.
2) It'd be good if the README explained how we coordinate access to
multiple atomic values (with the spinlock removed), why the current
approach is safe, or how it's made safe. Or at least mention that we
though about it, and where to look for details (e.g. in a comment for
particular function).
3) I have no opinion on how "clock-sweep" should be spelled, but if we
want to unify the spelling, it should be done in a separate preparatory
patch, so that it does not mix with the actual changes. It adds quite a
bit of changes.
3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes);
Either it's unlikely but possible, and then it should be elog(ERROR), or
we consider it impossible (and then assert is fine). Maybe just rephrase
the comment to say we consider the overflow it impossible.
Also, no need to split the passes_delta assignment, I think. It's fine
to subtract before an assert - it'll be bogus/negative, but so what?
It's not an undefined / invalid memory access or anything like that.
4) nitpick: freelist.c has incorrect formatting of "{" in a couple
places, e.g. it should be on a newline in a struct definition, etc.
pgindent would eventually fix this, but it bugs me.
5) I don't like the CLOCK_HAND_POSITION() naming very much, partially
because it's not really clear what "clock hand" is. Intuitively I know
what it's supposed to mean, also because I worked with this code before.
But not everyone has that benefit, and may assume something else.
IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter
that only ever increases, we should call it something different. Say,
"clockCweepCounter"?
Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why
not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too?
6) I don't understand why the comment mentions the BufferDescriptor
array at all? Sure, we may look at the descriptor, but why is this
detail relevant in this place?
Also, shouldn't it say "divide by NBuffers" at the end?
7) CLOCK_HAND_POSITION() should probably have a comment explaining why
it does the masking, etc. And why it's better than simply doing the
modulo on the counter directly. I mean, why is this better than just
doing (counter % NBuffers)? It's going to work on uint64 anyway.
8) There was a discussion about doing the modulo in a better way, but I
don't see any message explaining it clearly enough for me. And then
there were some issues with it, and I guess the idea was abandoned.
I'm asking because my NUMA patches do a modulo in a couple places in the
clock-sweep part too, so if there's a better way, I'd like to know.
9) Why initialize "hand" to UINT64_MAX? Seems unnecessary.
10) nitpick: StrategySyncStart formatting is a bit wrong
0003
----
1) I don't quite understand the purpose of this part. How is this
abstracting anything, compared to what we already have (or what the two
earlier parts did)?
And if it "abstracts the algorithm", what would be the purpose? Do we
expect some alternative algorithms? And would 0003 really make it easier
compared to just changing the code directly? I don't get it.
2) I don't understand why we need to explicitly pass the "clock" to
every ClockSweepCycles/ClockSweepPosition call.
I mean, there's only ever one instance anyway - at least without "my"
NUMA patches. But even with the NUMA patches, the partition is
determined inside those function, otherwise every caller would have to
repeat that. I think this is unnecessary/inconvenient.
regards
[1]: /messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
[2]: /messages/by-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3@euyewe6tf3z3
/messages/by-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3@euyewe6tf3z3
--
Tomas Vondra
Attachments:
On Aug 11 2025, at 9:09 am, Tomas Vondra <tomas@vondra.me> wrote:
Hi,
I took a look at this thread / patches, mostly because it's somewhat
related to the NUMA patch series [1]. Freelist is one of the things the
NUMA patches tweak, to split it into per-node partitions. So removing
the freelist would affect the NUMA patches, mostly simplifying it by
making the freelist part irrelevant.
Thank you Tomas for taking the time to review and performance test this
patch set. I greatly appreciate your feedback.
On the whole the idea seems reasonable - I'm not against removing the
freelist, assuming Andres is right and the freelist is not very useful
anyway. So I decided to validate this assumption by running a couple of
benchmarks ...
Amazing, thank you. I'll try to replicate your tests tomorrow to see if
my optimized division and modulo functions do in fact help or not. I
realize that both you and Anders are (rightly) concerned that the
performance impact of IDIV on some CPUs can be excessive.
I'm going to post this new set of patches and then start a separate
email to analyze and respond to the performance tests.
benchmarks
----------I did two tests:
1) run-pgbench.sh
- read-only pgbench
- scale set to different fraction of shared buffers (50%, 90%, 110%)
- tracks tps
2) run-seqscan.sh
- concurrect evic+seqscans on different tables
- shared buffers set to different fractions of total dataset sizes (50%,
90%, 110%)- This is similar to the workload Andres suggested in [2], except that
it uses a simple SELECT instead of pg_prewarm. The patches change how
pg_prewarm decides to stop, and I didn't want to be affected by that.- tracks tps, and average latency for the evict + select
The attached test scripts are "complete" - setup an instance. disable
checksums, set a couple GUCs, etc.I wanted to see the effect of each patch, so tests were done on master,
and then with patches applied one by one (after fixing the compile
failures in 0001).I did this on two EPYC machines with 96 cores (lscpu attached), and the
results are virtually the same. Attached are PDFs with results from one
of the machines.On the whole, the impact of the patches is negligible it's within 1% in
either direction, and I don't see any particular trend / systemic change
in the behavior (e.g. regressions for some cases). Seems like a noise,
which supports the assumption the freelist is not very useful.The one exception positive is "evict latency" which tracks latency of
pg_buffercache_evict_relation(). That got consistently a little bit
better, which makes sense as it does not need to maintain the freelist.freelist statistics?
--------------------
I like this idea too, I'll try to work it into the next set of patches.
There's one caveat, though. I find it tricky to "know" if the workload
actually uses a freelist. Is there a good way to determine if a buffer
was acquired from a freelist, or through the clocksweep?I'm worried the tests might actually have empty freelists, or maybe the
freelists won't be used very much. In which case removing the freelist
has understandably no impact. But it says nothing about cases getting
buffers from freelists often ...I can probably think about each workload and deduce if there will be
freelists. For the benchmarks described earlier, I think the situation
is this:pgbench, shared buffers < 100% -> no freelists, uses clock-sweep
pgbench, shared buffers > 100% -> freelists, but unused (no evictions)seqscan, shared buffers < 100% -> freelists, but clocksweep too
seqscan, shared buffers > 100% -> freelists, no clocksweepBut maybe I got it wrong for some cases? It'd be good to have a way to
collect some stats for each test, to confirm this and also to quantify
the effects. A test that gets 10% of buffers from a freelist may behave
differently from a test that gets 99% buffers from a freelist.I propose we add a system view showing interesting information about
freelists and buffer allocation, or perhaps extend an existing one (e.g.
pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches
I simply added a new view into pg_buffercache.I'm not suggesting this would get committed, at this point I'm more
focused on the development. Whether the view would be useful outside the
development is an open question. Also, if it adds stats about freelists
(for current master), that'd be useless once we remove them.Now, a couple review comments about the individual patches:
Apologies for not testing each commit separately.
0001
----1) nitpick: adds a blank line at the end of buffer/README
Removed.
2) gcc complains about missing have_free_buffer prototype, but it's no
no longer needed and can be removed
Removed.
3) freelist fails to compile, because of:
freelist.c: In function ‘have_free_buffer’:
freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’
from incompatible pointer type [-Wincompatible-pointer-types]
166 | uint64 hand =
pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
|
pg_atomic_uint32 *That is, it should be accessed using pg_atomic_read_u32.
Removed.
CI/cfbot does not report this, because it applies all the patches, and
0002 fixes this.
That makes sense, I'll be more careful with each patch in the future.
0002
----1) I think the buffer/README needs a bit more work. It now suddenly
mentions completePasses, without explaining what it's for, and how it
works with nextVictimBuffer. And then it's not mentioned again, so why
even mention it? We don't mention various other fields ...
I've taken a stab at expanding on the new approach, happy to continue to
refine the wording.
It's true completePasses is not a new thing, and probably should have
been already explained by the README. Still, it seems a bit confusing.
I explain the idea of "complete passes" and how it is now encoded in the
clockSweepCounter (spoiler, that's the new name for nextVictimBuffer).
2) It'd be good if the README explained how we coordinate access to
multiple atomic values (with the spinlock removed), why the current
approach is safe, or how it's made safe. Or at least mention that we
though about it, and where to look for details (e.g. in a comment for
particular function).
I think I've covered this base as well, let me know if it is still
lacking important content.
3) I have no opinion on how "clock-sweep" should be spelled, but if we
want to unify the spelling, it should be done in a separate preparatory
patch, so that it does not mix with the actual changes. It adds quite a
bit of changes.
Absolutely, I should have made that the first patch in the set. Now it is.
3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes);
Either it's unlikely but possible, and then it should be elog(ERROR), or
we consider it impossible (and then assert is fine). Maybe just rephrase
the comment to say we consider the overflow it impossible.
I've rephrased it to state "overflow is impossible" but memorialized in
an Assert for good measure.
Also, no need to split the passes_delta assignment, I think. It's fine
to subtract before an assert - it'll be bogus/negative, but so what?
It's not an undefined / invalid memory access or anything like that.
Okay, works for me.
4) nitpick: freelist.c has incorrect formatting of "{" in a couple
places, e.g. it should be on a newline in a struct definition, etc.
pgindent would eventually fix this, but it bugs me.
Yep, guilty as charged. I forgot that last pgindent run before git
format-patch. It bugs me too, fixed.
5) I don't like the CLOCK_HAND_POSITION() naming very much, partially
because it's not really clear what "clock hand" is. Intuitively I know
what it's supposed to mean, also because I worked with this code before.
But not everyone has that benefit, and may assume something else.
Now that the nextVictimBuffer is renamed to clockSweepCounter the new
macro names in 0003 are CLOCKSWEEP_HAND() and CLOCKSWEEP_PASSES(). In
0004 those turn into functions ClockSweepHand() and ClockSweepPasses().
IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter
that only ever increases, we should call it something different. Say,
"clockCweepCounter"?
Great idea and a good suggested name, renamed to clockSweepCounter.
Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why
not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too?
Agreed, done.
6) I don't understand why the comment mentions the BufferDescriptor
array at all? Sure, we may look at the descriptor, but why is this
detail relevant in this place?
It's not, removed.
Also, shouldn't it say "divide by NBuffers" at the end?
Also cleaned up.
7) CLOCK_HAND_POSITION() should probably have a comment explaining why
it does the masking, etc. And why it's better than simply doing the
modulo on the counter directly. I mean, why is this better than just
doing (counter % NBuffers)? It's going to work on uint64 anyway.
Took a swing at comments, not thrilled but better than nothing.
8) There was a discussion about doing the modulo in a better way, but I
don't see any message explaining it clearly enough for me. And then
there were some issues with it, and I guess the idea was abandoned.
Well, I made an attempt at implementing the algorithm in the
Granlund-Montgomery paper "Division by invariant Integers using
Multiplication"[1]https://gmplib.org/~tege/divcnst-pldi94.pdf and got it to work eventually (attached as
v13-0004b.txt to avoid being recognized as part of this patch set,
rename to
v13-0004-Optimize-modulo-and-division-used-in-clock-sweep.patch). In my
approach I created logic that had two paths for division and modulo, one
when NBuffers was a power-of-2 and one when it wasn't. I didn't mind
having a branch in the inline'ed functions because it's likely that the
branch predictor will get it right 100% of the time.
Then I found an implementation of the paper called "fastdiv"[2]https://github.com/jmtilli/fastdiv that
doesn't branch but has a few more instructions. It is released under
the MIT License. I'm not sure what the convention here in PostgreSQL
code is when including some code inspired by another project (although
now is a good time for me to search the wiki and find out). Both my
implementation and the fastdiv implementation work and should be
considerably faster on modern CPUs. I've attached the fastdiv version,
but I can revert to mine if that's easier or more in line with community
standards. I'd estimate the fastdiv version to run in about ~12-18
cycles (vs 26-90 for hardware division or modulo) on most CPU architectures.
I'm asking because my NUMA patches do a modulo in a couple places in the
clock-sweep part too, so if there's a better way, I'd like to know.
Tell me your thoughts on that algorithm, I think it does the job.
9) Why initialize "hand" to UINT64_MAX? Seems unnecessary.
Yep, removed.
10) nitpick: StrategySyncStart formatting is a bit wrong
Yes, fixed.
0003
----1) I don't quite understand the purpose of this part. How is this
abstracting anything, compared to what we already have (or what the two
earlier parts did)?And if it "abstracts the algorithm", what would be the purpose? Do we
expect some alternative algorithms? And would 0003 really make it easier
compared to just changing the code directly? I don't get it.2) I don't understand why we need to explicitly pass the "clock" to
every ClockSweepCycles/ClockSweepPosition call.I mean, there's only ever one instance anyway - at least without "my"
NUMA patches. But even with the NUMA patches, the partition is
determined inside those function, otherwise every caller would have to
repeat that. I think this is unnecessary/inconvenient.
Okay, I understand that. At one point I had implementations for
ClockPro and SIEVE but you're right.
regards
[1]
/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me[2]
/messages/by-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3@euyewe6tf3z3--
Tomas Vondra
Thanks again for digging into the code. I took on this project to help
your NUMA work, but then once I got started it seemed like a good idea
even if that doesn't land as this should remove contention across cores
and hopefully be generally faster.
-greg
[1]: https://gmplib.org/~tege/divcnst-pldi94.pdf
[2]: https://github.com/jmtilli/fastdiv
Attachments:
v13-0001-Use-consistent-naming-of-the-clock-sweep-algorit.patchapplication/octet-streamDownload
From be17b34e4be95174be837637041edc6fc4d55557 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 12 Aug 2025 06:57:12 -0400
Subject: [PATCH v13 1/4] Use consistent naming of the clock-sweep algorithm.
Minor edits to comments only.
---
src/backend/storage/buffer/README | 4 ++--
src/backend/storage/buffer/bufmgr.c | 8 ++++----
src/backend/storage/buffer/freelist.c | 10 +++++-----
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 ++--
5 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..4b13da5d7ad 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -211,9 +211,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fd7e21d96d3..396b053b3fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3608,7 +3608,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3658,7 +3658,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
+ * Find out where the freelist clock-sweep currently is, and how many
* buffer allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3679,8 +3679,8 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..cd94a7d8a7b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -33,7 +33,7 @@ typedef struct
slock_t buffer_strategy_lock;
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
+ * clock-sweep hand: index of next buffer to consider grabbing. Note that
* this isn't a concrete buffer - we only ever increase the value. So, to
* get an actual buffer, it needs to be used modulo NBuffers.
*/
@@ -51,7 +51,7 @@ typedef struct
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
+ uint32 completePasses; /* Complete cycles of the clock-sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -311,7 +311,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Nothing on the freelist, so run the "clock-sweep" algorithm */
trycounter = NBuffers;
for (;;)
{
@@ -511,7 +511,7 @@ StrategyInitialize(bool init)
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
- /* Initialize the clock sweep pointer */
+ /* Initialize the clock-sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
/* Clear statistics */
@@ -759,7 +759,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3c0d20f4659..04fef13409b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..3a210c710f6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -80,8 +80,8 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* The maximum allowed value of usage_count represents a tradeoff between
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
- * But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of the
+ * clock-sweep hand to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
--
2.49.0
v13-0003-Remove-the-need-for-a-buffer_strategy_lock.patchapplication/octet-streamDownload
From 7d3012cbb1aaddfa112966f641aa129e88a42600 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 12 Aug 2025 09:33:38 -0400
Subject: [PATCH v13 3/4] Remove the need for a buffer_strategy_lock
Combine nextVictimBuffer and completePasses into a single uint64 called
clockSweepCounter so as to eliminate the need for the
buffer_strategy_lock that synchronizes their progression. Increment this
counter atomically by 1 at each tick. The hand's location is the
counter modulo NBuffers, the number of complete passes is the counter
divided by NBuffers. Overflow of the clockSweepCounter would require ~10
years of continuous operation at ~59 billion ticks per-second, so
unlikely that we consider this to be impossible.
---
src/backend/storage/buffer/README | 70 +++++++------
src/backend/storage/buffer/bufmgr.c | 8 ++
src/backend/storage/buffer/freelist.c | 135 ++++++++------------------
src/include/storage/buf_internals.h | 2 +-
4 files changed, 86 insertions(+), 129 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 119f31b5d65..52d5b2c4069 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement. A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held. This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
@@ -158,30 +157,39 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-To choose a victim buffer to recycle we use a simple clock-sweep algorithm. It
-works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+done carefully this can be accomplished without the need to take a system-wide
+lock. The strategy works like this:
+
+The "clock hand" is a buffer index that conceptually moves circularly through
+all the available buffers in the range of 0 to NBuffers-1. Each time the hand
+returns to 0 is a "complete pass" of the buffers managed by the clock. The
+hand progresses one "tick" at a time around the clock identifying a potential
+"victim". These two values, the hand's position and the number of complete
+passes, must be consistent across backends.
+
+In this implementation the hand's position on the clock is determined by the
+value of clockSweepCounter modulo the value of NBuffers. The value of
+clockSweepCounter, a uint64, is atomically incremented by 1 at each tick of the
+clock. The number of complete passes is therefor the clockSweepCounter divided
+by NBuffers. This reduces the coordination across backends to either an atomic
+read or an atomic fetch add.
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
buffer header spinlock, which would have to be taken anyway to increment the
buffer reference count, so it's nearly free.)
-The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers. nextVictimBuffer is protected by the
-buffer_strategy_lock.
-
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain buffer_strategy_lock.
-
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
+1. Move around the clock one tick. Atomically read and advance
+clockSweepCounter by 1 and return its previous value modulo NBuffers.
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used. Decrement its usage count (if nonzero), return to step 1 to examine the
+next buffer.
-4. Pin the selected buffer, and return.
+3. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -237,19 +245,17 @@ Background Writer's Processing
------------------------------
The background writer is designed to write out pages that are likely to be
-recycled soon, thereby offloading the writing work from active backends.
-To do this, it scans forward circularly from the current position of
-nextVictimBuffer (which it does not change!), looking for buffers that are
-dirty and not pinned nor marked with a positive usage count. It pins,
-writes, and releases any such buffer.
-
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit. Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers. (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+recycled soon, thereby offloading the writing work from active backends. To do
+this, it scans forward circularly from the current position of the clock-sweep
+hand (read atomically and not modified), looking for buffers that are dirty and
+not pinned nor marked with a positive usage count. It pins, writes, and
+releases any such buffer.
+
+It only needs to spinlock each buffer header for long enough to check the
+dirtybit. Even without that assumption, the writer only needs to take the lock
+long enough to read the variable value, not while scanning the buffers. (This
+is a very substantial improvement in the contention cost of the writer compared
+to PG 8.0.)
The background writer takes shared content lock on a buffer while writing it
out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 719a5bb6f97..d0c14158115 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3666,6 +3666,14 @@ BgBufferSync(WritebackContext *wb_context)
{
int32 passes_delta = strategy_passes - prev_strategy_passes;
+ /*
+ * It would take ~10 years of continuous operation at ~59 billion
+ * clock ticks per-second to overflow the uint64 value of
+ * clockSweepCounter. We consider this impossible and memorialize that
+ * decision with this assert.
+ */
+ Assert(prev_strategy_passes <= strategy_passes);
+
strategy_delta = strategy_buf_id - prev_strategy_buf_id;
strategy_delta += (long) passes_delta * NBuffers;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..7d68f2227b3 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,21 +29,18 @@
*/
typedef struct
{
- /* Spinlock: protects the values below */
- slock_t buffer_strategy_lock;
-
/*
- * clock-sweep hand: index of next buffer to consider grabbing. Note that
- * this isn't a concrete buffer - we only ever increase the value. So, to
- * get an actual buffer, it needs to be used modulo NBuffers.
+ * The clock-sweep counter is atomically updated by 1 at every tick. Use
+ * the macro CLOCKSWEEP_HAND() to find the location of the hand on the
+ * clock. Use CLOCKSWEEP_PASSES() to calculate the number of times the
+ * clock-sweep hand has made a complete pass around the clock.
*/
- pg_atomic_uint32 nextVictimBuffer;
+ pg_atomic_uint64 clockSweepCounter;
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock-sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -83,76 +80,47 @@ typedef struct BufferAccessStrategyData
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
} BufferAccessStrategyData;
-
/* Prototypes for internal functions */
static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
uint32 *buf_state);
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+/*
+ * The clock-sweep counter is a uint64 but the clock hand can never be larger
+ * than a uint32. Enforce that contract uniformly using this macro.
+ */
+#define CLOCKSWEEP_HAND(counter) \ ((uint32) (counter)) % NBuffers
+
+/*
+ * The number of times the clock hand has made a complete pass around the clock
+ * visiting all the available buffers is the counter divided by NBuffers.
+ */
+#define CLOCKSWEEP_PASSES(counter) \ (uint32) ((counter) / NBuffers)
+
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
*
* Move the clock hand one buffer ahead of its current position and return the
- * id of the buffer now under the hand.
+ * index of the buffer previously under the hand.
*/
static inline uint32
ClockSweepTick(void)
{
- uint32 victim;
+ uint64 counter;
+ uint32 hand;
/*
* Atomically move hand ahead one buffer - if there's several processes
* doing this, this can lead to buffers being returned slightly out of
* apparent order.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
- if (victim >= NBuffers)
- {
- uint32 originalVictim = victim;
+ counter = pg_atomic_fetch_add_u64(&StrategyControl->clockSweepCounter, 1);
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
-
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
- {
- uint32 expected;
- uint32 wrapped;
- bool success = false;
-
- expected = originalVictim + 1;
-
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ hand = CLOCKSWEEP_HAND(counter);
+ Assert(hand < NBuffers);
- wrapped = expected % NBuffers;
-
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
- }
- return victim;
+ return hand;
}
/*
@@ -177,10 +145,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*from_ring = false;
- /*
- * If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need buffer_strategy_lock.
- */
+ /* If given a strategy object, see whether it can select a buffer */
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy, buf_state);
@@ -275,37 +240,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
* The result is the buffer index of the best buffer to sync first.
* BgBufferSync() will proceed circularly around the buffer array from there.
*
- * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed. The alloc count is reset after
- * being read.
+ * In addition, we return the completed-pass count and the count of recent
+ * buffer allocs if non-NULL pointers are passed. The alloc count is reset
+ * after being read.
*/
-int
+uint32
StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
- uint32 nextVictimBuffer;
- int result;
+ uint64 counter;
+ uint32 result;
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
- nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
- result = nextVictimBuffer % NBuffers;
+ counter = pg_atomic_read_u64(&StrategyControl->clockSweepCounter);
+ result = CLOCKSWEEP_HAND(counter);
if (complete_passes)
- {
- *complete_passes = StrategyControl->completePasses;
-
- /*
- * Additionally add the number of wraparounds that happened before
- * completePasses could be incremented. C.f. ClockSweepTick().
- */
- *complete_passes += nextVictimBuffer / NBuffers;
- }
+ *complete_passes = CLOCKSWEEP_PASSES(counter);
if (num_buf_alloc)
- {
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
- }
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
return result;
}
@@ -320,14 +273,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
void
StrategyNotifyBgWriter(int bgwprocno)
{
- /*
- * We acquire buffer_strategy_lock just to ensure that the store appears
- * atomic to StrategyGetBuffer. The bgwriter should call this rather
- * infrequently, so there's no performance penalty from being safe.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwprocno = bgwprocno;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
@@ -392,13 +338,10 @@ StrategyInitialize(bool init)
*/
Assert(init);
- SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
- /* Initialize the clock-sweep pointer */
- pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+ /* Initialize combined clock-sweep pointer/complete passes counter */
+ pg_atomic_init_u64(&StrategyControl->clockSweepCounter, 0);
/* Clear statistics */
- StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
/* No pending notification */
@@ -714,9 +657,9 @@ IOContextForStrategy(BufferAccessStrategy strategy)
* When a nondefault strategy is used, the buffer manager calls this function
* when it turns out that the buffer selected by StrategyGetBuffer needs to
* be written out and doing so would require flushing WAL too. This gives us
- * a chance to choose a different victim.
+ * a chance to choose a different buffer.
*
- * Returns true if buffer manager should ask for a new victim, and false
+ * Returns true if buffer manager should ask for a new buffer, and false
* if this buffer should be written and re-used.
*/
bool
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9fcc94ef02d..b6ff361f2dd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
BufferDesc *buf, bool from_ring);
-extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
--
2.49.0
v13-0004-Optimize-modulo-and-division-used-in-clock-sweep.patchapplication/octet-streamDownload
From d5fed9a73fac8579c87952d4551adf5bdfeba8c4 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 12 Aug 2025 10:58:52 -0400
Subject: [PATCH v13 4/4] Optimize modulo and division used in clock-sweep
algorithm
Improve the performance of the buffer manager by replacing the modulo
and division operations with a technique described in the paper
"Division by Invariant Integers using Multiplication" [1]. Our
implementation is inspired by the MIT Licensed "fastdiv" [2]. This
algorithm provides accurate division and modulo in constant time that is
pipeline and ALU friendly and estimated to take about ~12-18 cycles (vs
26-90 for hardware division). Because our divisor (NBuffers) is fixed
at startup so we need only calculate the constant used by it once.
[1] https://gmplib.org/~tege/divcnst-pldi94.pdf
[2] https://github.com/jmtilli/fastdiv
---
src/backend/storage/buffer/freelist.c | 106 ++++++++++++++++++++++++--
1 file changed, 98 insertions(+), 8 deletions(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d68f2227b3..96ae21fb152 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -31,12 +31,28 @@ typedef struct
{
/*
* The clock-sweep counter is atomically updated by 1 at every tick. Use
- * the macro CLOCKSWEEP_HAND() to find the location of the hand on the
- * clock. Use CLOCKSWEEP_PASSES() to calculate the number of times the
+ * the function ClockSweepHand() to find the location of the hand on the
+ * clock. Use ClockSweepPasses() to calculate the number of times the
* clock-sweep hand has made a complete pass around the clock.
*/
pg_atomic_uint64 clockSweepCounter;
+ /*
+ * Division and modulo can be expensive to calculate repeatedly. Given
+ * that the buffer manager is a very hot code path we implement a more
+ * efficient method based on using "Division by invariant Integers using
+ * Multiplication" (https://gmplib.org/~tege/divcnst-pldi94.pdf) by
+ * Granlund-Montgomery. Our implementation below was inspired by the MIT
+ * Licensed "fastdiv" (https://github.com/jmtilli/fastdiv).
+ */
+ struct
+ {
+ uint32 mul;
+ uint32 mod;
+ uint8 shift1:1;
+ uint8 shift2:7;
+ } md;
+
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -86,17 +102,75 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+static inline uint32
+InvariantDivision(uint64 n)
+{
+ /* Compute quotient using multiplication */
+ uint64 product = n * StrategyControl->md.mul;
+ uint32 quotient = (uint32) (product >> 32);
+
+ /*
+ * The invariant multiplication gives us an approximation that may be off
+ * by 1.
+ */
+ n -= quotient;
+ n >>= StrategyControl->md.shift1;
+ n += quotient;
+ n >>= StrategyControl->md.shift2;
+
+ return n;
+}
+
+static inline uint32
+InvariantModulo(uint64 n)
+{
+ /* Compute quotient using multiplication */
+ uint64 product = n * StrategyControl->md.mul;
+ uint32 quotient = (uint32) (product >> 32);
+ uint32 on = n;
+
+ /*
+ * The invariant multiplication gives us an approximation that may be off
+ * by 1.
+ */
+ n -= quotient;
+ n >>= StrategyControl->md.shift1;
+ n += quotient;
+ n >>= StrategyControl->md.shift2;
+
+ quotient = StrategyControl->md.mod * n;
+ return on - quotient;
+}
+
/*
* The clock-sweep counter is a uint64 but the clock hand can never be larger
- * than a uint32. Enforce that contract uniformly using this macro.
+ * than a uint32.
*/
-#define CLOCKSWEEP_HAND(counter) \ ((uint32) (counter)) % NBuffers
+static inline uint32
+ClockSweepHand(uint64 counter)
+{
+ uint32 result = InvariantModulo(counter);
+
+ Assert(result < NBuffers);
+ Assert(result == (uint32) counter % NBuffers);
+
+ return result;
+}
/*
* The number of times the clock hand has made a complete pass around the clock
* visiting all the available buffers is the counter divided by NBuffers.
*/
-#define CLOCKSWEEP_PASSES(counter) \ (uint32) ((counter) / NBuffers)
+static inline uint32
+ClockSweepPasses(uint64 counter)
+{
+ uint32 result = InvariantDivision(counter);
+
+ /* Verify our result matches standard division */
+ Assert(result == (uint32) (counter / NBuffers));
+
+ return result;
+}
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -117,7 +191,7 @@ ClockSweepTick(void)
*/
counter = pg_atomic_fetch_add_u64(&StrategyControl->clockSweepCounter, 1);
- hand = CLOCKSWEEP_HAND(counter);
+ hand = ClockSweepHand(counter);
Assert(hand < NBuffers);
return hand;
@@ -251,10 +325,10 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
uint32 result;
counter = pg_atomic_read_u64(&StrategyControl->clockSweepCounter);
- result = CLOCKSWEEP_HAND(counter);
+ result = ClockSweepHand(counter);
if (complete_passes)
- *complete_passes = CLOCKSWEEP_PASSES(counter);
+ *complete_passes = ClockSweepPasses(counter);
if (num_buf_alloc)
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
@@ -333,11 +407,27 @@ StrategyInitialize(bool init)
if (!found)
{
+ uint8 shift2 = 0;
+ uint32 divisor = NBuffers;
+ uint8 is_pow2 = (divisor & (divisor - 1)) == 0 ? 0 : 1;
+
/*
* Only done once, usually in postmaster
*/
Assert(init);
+ /* Calculate the constants used for speeding up division and modulo */
+ Assert(NBuffers > 0 && NBuffers < (1U << 31));
+
+ /* shift2 = ilog(NBuffers) */
+ for (uint32 n = divisor; n >>= 1;)
+ shift2++;
+
+ StrategyControl->md.shift1 = is_pow2;
+ StrategyControl->md.shift2 = shift2;
+ StrategyControl->md.mod = NBuffers;
+ StrategyControl->md.mul = (1ULL << (32 + is_pow2 + shift2)) / NBuffers + 1;
+
/* Initialize combined clock-sweep pointer/complete passes counter */
pg_atomic_init_u64(&StrategyControl->clockSweepCounter, 0);
--
2.49.0
v13-0004b.txttext/plainDownload
From 4fb02b26be463b9686bcef4f7bc2caeba2d48220 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 12 Aug 2025 17:20:55 -0400
Subject: [PATCH v13 4/4] Optimize modulo and division used in clock-sweep
algorithm
Improve the performance of the buffer manager by replacing the modulo
and division operations with more pipeline friendly implementations.
When the size of the clock (NBuffers) is a power-of-two we can simply
bitwise and with a pre-computed mask (NBuffers - 1) to get the modulo (4
instructions, ~3-4 cycles) or bitshift (log2(NBuffres)) for division.
When it isn't we can replace modulo and division using a 64-bit
multiplication by the inverse of NBuffers and a right shift as described
in the paper "Division by Invariant Integers using Multiplication" (4
instructions, ~8-12 cycles) and then do a bit of accounting for the
remainder. In either case the branch prediction should be nearly 100%
given that NBuffers never changes at runtime. In comparison a modulo
operation translates into IDIV and the code would require ~26-90 cycles.
Switching to these methods should use common bitshift and ALU operations
that don't block the pipeline and have better instruction level
parallelism.
[1] https://gmplib.org/~tege/divcnst-pldi94.pdf
---
src/backend/storage/buffer/freelist.c | 114 ++++++++++++++++++++++++--
1 file changed, 107 insertions(+), 7 deletions(-)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d68f2227b3..65f3a6eb5a5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,8 @@
*/
#include "postgres.h"
+#include <math.h>
+
#include "pgstat.h"
#include "port/atomics.h"
#include "storage/buf_internals.h"
@@ -31,12 +33,24 @@ typedef struct
{
/*
* The clock-sweep counter is atomically updated by 1 at every tick. Use
- * the macro CLOCKSWEEP_HAND() to find the location of the hand on the
- * clock. Use CLOCKSWEEP_PASSES() to calculate the number of times the
+ * the function ClockSweepHand() to find the location of the hand on the
+ * clock. Use ClockSweepPasses() to calculate the number of times the
* clock-sweep hand has made a complete pass around the clock.
*/
pg_atomic_uint64 clockSweepCounter;
+ /*
+ * Modulo can be expensive to calculate repeatedly, so we implement two
+ * strategies to avoid it. When NBuffers is a power-of-2 we can replace
+ * modulo with a bit shift, when it is not we implement a more pipeline
+ * friendly modulo using "Division by invariant Integers using
+ * Multiplication" (https://gmplib.org/~tege/divcnst-pldi94.pdf).
+ */
+ uint64 mask;
+ uint64 multiplier;
+ uint32 shift;
+ bool pow2;
+
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -86,17 +100,81 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
+static inline uint32
+InvariantDivision(uint64 n)
+{
+ uint32 divisor = NBuffers;
+
+ /* Compute quotient using multiplication */
+ uint64 product = n * StrategyControl->multiplier;
+ uint32 quotient = (uint32) (product >> 32);
+
+ /*
+ * The invariant multiplication gives us an approximation that may be off
+ * by 1. Check if we need to adjust upward.
+ */
+ uint32 remainder = n - quotient * divisor;
+
+ if (remainder >= NBuffers)
+ quotient++;
+
+ return quotient;
+}
+
+static inline uint32
+InvariantModulo(uint64 n)
+{
+ uint32 quotient = (uint32) ((n * StrategyControl->multiplier) >> 32);
+ uint32 remainder = (uint32) (n - (uint64) quotient * NBuffers);
+
+ if (remainder >= NBuffers)
+ remainder -= NBuffers;
+
+ return remainder;
+}
+
/*
* The clock-sweep counter is a uint64 but the clock hand can never be larger
* than a uint32. Enforce that contract uniformly using this macro.
*/
-#define CLOCKSWEEP_HAND(counter) \ ((uint32) (counter)) % NBuffers
+static inline uint32
+ClockSweepHand(uint64 counter)
+{
+ uint32 result;
+
+ if (StrategyControl->pow2)
+ /* Power of 2: use mask */
+ result = counter & StrategyControl->mask;
+ else
+ /* Non-power of 2: use magic modulo */
+ result = InvariantModulo(counter);
+
+ Assert(result < NBuffers);
+ Assert(result == (uint32) (counter % NBuffers));
+
+ return result;
+}
/*
* The number of times the clock hand has made a complete pass around the clock
* visiting all the available buffers is the counter divided by NBuffers.
*/
-#define CLOCKSWEEP_PASSES(counter) \ (uint32) ((counter) / NBuffers)
+static inline uint32
+ClockSweepPasses(uint64 counter)
+{
+ uint32 result;
+
+ if (StrategyControl->pow2)
+ /* Power of 2: use shift */
+ result = counter >> StrategyControl->shift;
+ else
+ /* Non-power of 2: use magic modulo */
+ result = InvariantDivision(counter);
+
+ Assert(result == (uint32) (counter / NBuffers));
+
+ return result;
+}
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
@@ -117,7 +195,7 @@ ClockSweepTick(void)
*/
counter = pg_atomic_fetch_add_u64(&StrategyControl->clockSweepCounter, 1);
- hand = CLOCKSWEEP_HAND(counter);
+ hand = ClockSweepHand(counter);
Assert(hand < NBuffers);
return hand;
@@ -251,10 +329,10 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
uint32 result;
counter = pg_atomic_read_u64(&StrategyControl->clockSweepCounter);
- result = CLOCKSWEEP_HAND(counter);
+ result = ClockSweepHand(counter);
if (complete_passes)
- *complete_passes = CLOCKSWEEP_PASSES(counter);
+ *complete_passes = ClockSweepPasses(counter);
if (num_buf_alloc)
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
@@ -341,6 +419,27 @@ StrategyInitialize(bool init)
/* Initialize combined clock-sweep pointer/complete passes counter */
pg_atomic_init_u64(&StrategyControl->clockSweepCounter, 0);
+ if ((NBuffers & (NBuffers - 1)) == 0)
+ {
+ /* NBuffers is a power-of-2. */
+ StrategyControl->pow2 = true;
+ StrategyControl->mask = NBuffers - 1;
+ StrategyControl->shift = log2(NBuffers);
+ StrategyControl->multiplier = 0;
+ }
+ else
+ {
+ StrategyControl->pow2 = false;
+ StrategyControl->mask = 0;
+ StrategyControl->shift = 0;
+
+ /*
+ * Calculate the invariant constant for later using ceil()
+ * division.
+ */
+ StrategyControl->multiplier = ((1ULL << 32) + NBuffers - 1) / NBuffers;
+ }
+
/* Clear statistics */
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
--
2.49.0
v13-0002-Eliminate-the-freelist-from-the-buffer-manager-a.patchapplication/octet-streamDownload
From 8c0733e8f0ddddd97e461057fabb254a11d32dd8 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v13 2/4] Eliminate the freelist from the buffer manager and
depend on clock-sweep
This set of changes removes the list of available buffers and instead
simply uses the clock-sweep algorithm to find and return an available
buffer. While on the surface this appears to be removing an
optimization it is in fact eliminating code that induces overhead in the
form of synchronization that is problemmatic for multi-core systems.
This also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.
---
contrib/pg_prewarm/autoprewarm.c | 31 ++++---
src/backend/storage/buffer/README | 40 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +------
src/backend/storage/buffer/freelist.c | 119 +-------------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 32 insertions(+), 208 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..2722b0bb443 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -370,6 +370,16 @@ apw_load_buffers(void)
apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0;
apw_state->prewarmed_blocks = 0;
+
+ /* Don't prewarm more than we can fit. */
+ if (num_elements > NBuffers)
+ {
+ num_elements = NBuffers;
+ ereport(LOG,
+ (errmsg("autoprewarm: capping prewarmed blocks to %d (shared_buffers size)",
+ NBuffers)));
+ }
+
/* Get the info position of the first block of the next database. */
while (apw_state->prewarm_start_idx < num_elements)
{
@@ -410,10 +420,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +468,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -523,10 +523,10 @@ autoprewarm_database_main(Datum main_arg)
blk = block_info[i];
/*
- * Loop until we run out of blocks to prewarm or until we run out of free
+ * Loop until we run out of blocks to prewarm or until we run out of
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -568,14 +568,13 @@ autoprewarm_database_main(Datum main_arg)
/*
* We have a relation; now let's loop until we find a valid fork of
- * the relation or we run out of free buffers. Once we've read from
- * all valid forks or run out of options, we'll close the relation and
+ * the relation or we run out of buffers. Once we've read from all
+ * valid forks or run out of options, we'll close the relation and
* move on.
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 4b13da5d7ad..119f31b5d65 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,8 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +174,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +218,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 396b053b3fa..719a5bb6f97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2094,12 +2094,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2158,8 +2152,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2257,11 +2250,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2679,11 +2667,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2756,12 +2739,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3658,8 +3635,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock-sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock-sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index cd94a7d8a7b..7d59a92bd1a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -163,23 +155,6 @@ ClockSweepTick(void)
return victim;
}
-/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
- *
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
- */
-bool
-have_free_buffer(void)
-{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -249,69 +224,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock-sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +269,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +394,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock-sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3a210c710f6..9fcc94ef02d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
On Wed, Aug 13, 2025 at 9:42 AM Greg Burd <greg@burd.me> wrote:
Amazing, thank you. I'll try to replicate your tests tomorrow to see if
my optimized division and modulo functions do in fact help or not. I
realize that both you and Anders are (rightly) concerned that the
performance impact of IDIV on some CPUs can be excessive.
At the risk of posting untested crackpot theories on the internet, I
wonder if there is a way to use a simple boundary condition and
subtraction for this. If you correct overshoot compared to an
advancing-in-strides base value, then I wonder how often you'd finish
up having to actually do that under concurrency. Obviously in
general, implementing modulo with subtraction is a terrible idea, but
can you make it so that the actual cost works out as mostly 0, rarely
1 and exceedingly rarely more than 1 subtraction loops? If that's
true, do the branches somehow kill you?
Assume for now that we're OK with keeping % and / for the infrequent
calls to StrategySyncStart(), or we can redefinine the bgwriter's
logic so that it doesn't even need those (perhaps what it really wants
to know is its total distance behind the allocator, so perhaps we can
define that problem away? haven't thought about that yet...). What
I'm wondering out loud is whether the hot ClockSweepTick() code might
be able to use something nearly as dumb as this...
/* untested pseudocode */
ticks_base = pg_atomic_read_u64(&x->ticks_base);
ticks = pg_atomic_fetch_add_u64(&x->ticks, 1);
hand = ticks - ticks_base;
/*
* Compensate for overshoot. Expected number of loops: none most of the
* time, one when we overshoot, and maybe more if the system gets
* around the whole clock before we see the base value advance.
*/
while (hand >= NBuffers)
{
/* Base value advanced by backend that overshoots by one tick. */
if (hand == NBuffers)
pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
hand -= NBuffers;
}
On Sat, Aug 16, 2025 at 3:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
while (hand >= NBuffers)
{
/* Base value advanced by backend that overshoots by one tick. */
if (hand == NBuffers)
pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
hand -= NBuffers;
}
Or if you don't like those odds, maybe it'd be OK to keep % but use it
rarely and without the CAS that can fail. I assume it would still
happen occasionally in more than one backend due to the race against
the base advancing a few instructions later, but maybe that'd work out
OK? I dunno. The point would be to make it rare. And with a
per-NUMA-node CLOCK, hopefully quite rare indeed. I guess this way
you don't need to convince yourself that ticks_base is always <= ticks
for all cores, since it would self-correct (if it appears to one core
that ticks_base > ticks then hand will be a very large number and take
this branch). IDK, again untested, just throwing ideas out there...
if (hand >= NBuffers)
{
hand %= NBuffers;
/* Base value advanced by backend that overshoots by one tick. */
if (hand == 0)
pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
}
On Sun, Aug 17, 2025 at 4:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Or if you don't like those odds, maybe it'd be OK to keep % but use it
rarely and without the CAS that can fail.
... or if we wanted to try harder to avoid %, could we relegate it to
the unlikely CLOCK-went-all-the-way-around-again-due-to-unlucky-scheduling
case, but use subtraction for the expected periodic overshoot?
if (hand >= NBuffers)
{
hand = hand < Nbuffers * 2 ? hand - NBuffers : hand % NBuffers;
/* Base value advanced by backend that overshoots by one tick. */
if (hand == 0)
pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
}
On Aug 17 2025, at 12:57 am, Thomas Munro <thomas.munro@gmail.com> wrote:
On Sun, Aug 17, 2025 at 4:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Or if you don't like those odds, maybe it'd be OK to keep % but use it
rarely and without the CAS that can fail.... or if we wanted to try harder to avoid %, could we relegate it to
the unlikely CLOCK-went-all-the-way-around-again-due-to-unlucky-scheduling
case, but use subtraction for the expected periodic overshoot?if (hand >= NBuffers)
{
hand = hand < Nbuffers * 2 ? hand - NBuffers : hand % NBuffers;
/* Base value advanced by backend that overshoots by one tick. */
if (hand == 0)
pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
}
Hi Tomas,
Thanks for all the ideas, I have tried out a few of them and a number of
other ideas. I've done a lot of measurement and had a few off channel
discussions about this and I think the best way to move forward is to
just focus on the removal of the freelist and not bother with the lock
or changing clock-sweep right now too much. So, the attached patch set
keeps the first two from the last set but drops the rest.
But wait, there's more...
As a *bonus* I've added a new third patch with some proposed changes to
spark discussions. As I researched experiences in the field at scale a
few other buffer management issues came to light. The one in particular
that I tried to address in this new patch 0003 has to do with very large
shared_buffers (NBuffers) and very large active datasets causing most
buffer usage counts to be at or near the max value (5). In these cases
the clock-sweep algorithm needs to perform NBuffers * 5 "ticks" before
identifying a buffer to evict. This also pollutes the completePasses
value used to inform the bgwriter where to start working.
So, in this patch I add per-backend buffer usage tracking and proactive
pressure management. Each tick of the hand can now decrement usage by a
calculated amount, not just 1, based on /hand-wavy-first-attempt at magic/.
The thing I'm sure this doesn't help with, and may in fact hurt, is
keeping frequently accessed buffers in the buffer pool. I imagine a two
tier approach to this where some small subset of buffers that are reused
frequently enough are not even considered by the clock-sweep algorithm.
Regardless, I feel the first two patches on this set address the
intention of this thread. I added patch 0003 just to start a
conversation, please chime in if any of this interests you. Maybe this
new patch should take on a life of its own in a new thread? If anyone
thinks this approach has some merit, I'll do that.
I look forward to thoughts on these idea, and hopefully to finding
someone willing to help me get the first two over the line.
best.
-greg
Attachments:
v14-0001-Use-consistent-naming-of-the-clock-sweep-algorit.patchapplication/octet-streamDownload
From 9d7359801a7ad428a3afd46c83b9dcbb520fa911 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 12 Aug 2025 06:57:12 -0400
Subject: [PATCH v14 1/3] Use consistent naming of the clock-sweep algorithm.
Minor edits to comments only.
---
src/backend/storage/buffer/README | 4 ++--
src/backend/storage/buffer/bufmgr.c | 8 ++++----
src/backend/storage/buffer/freelist.c | 10 +++++-----
src/backend/storage/buffer/localbuf.c | 2 +-
src/include/storage/buf_internals.h | 4 ++--
5 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..4b13da5d7ad 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -211,9 +211,9 @@ Buffer Ring Replacement Strategy
When running a query that needs to access a large number of pages just once,
such as VACUUM or a large sequential scan, a different strategy is used.
A page that has been touched only by such a scan is unlikely to be needed
-again soon, so instead of running the normal clock sweep algorithm and
+again soon, so instead of running the normal clock-sweep algorithm and
blowing out the entire buffer cache, a small ring of buffers is allocated
-using the normal clock sweep algorithm and those buffers are reused for the
+using the normal clock-sweep algorithm and those buffers are reused for the
whole scan. This also implies that much of the write traffic caused by such
a statement will be done by the backend itself and not pushed off onto other
processes.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fd7e21d96d3..396b053b3fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3608,7 +3608,7 @@ BufferSync(int flags)
* This is called periodically by the background writer process.
*
* Returns true if it's appropriate for the bgwriter process to go into
- * low-power hibernation mode. (This happens if the strategy clock sweep
+ * low-power hibernation mode. (This happens if the strategy clock-sweep
* has been "lapped" and no buffer allocations have occurred recently,
* or if the bgwriter has been effectively disabled by setting
* bgwriter_lru_maxpages to 0.)
@@ -3658,7 +3658,7 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock sweep currently is, and how many
+ * Find out where the freelist clock-sweep currently is, and how many
* buffer allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
@@ -3679,8 +3679,8 @@ BgBufferSync(WritebackContext *wb_context)
/*
* Compute strategy_delta = how many buffers have been scanned by the
- * clock sweep since last time. If first time through, assume none. Then
- * see if we are still ahead of the clock sweep, and if so, how many
+ * clock-sweep since last time. If first time through, assume none. Then
+ * see if we are still ahead of the clock-sweep, and if so, how many
* buffers we could scan before we'd catch up with it and "lap" it. Note:
* weird-looking coding of xxx_passes comparisons are to avoid bogus
* behavior when the passes counts wrap around.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..cd94a7d8a7b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -33,7 +33,7 @@ typedef struct
slock_t buffer_strategy_lock;
/*
- * Clock sweep hand: index of next buffer to consider grabbing. Note that
+ * clock-sweep hand: index of next buffer to consider grabbing. Note that
* this isn't a concrete buffer - we only ever increase the value. So, to
* get an actual buffer, it needs to be used modulo NBuffers.
*/
@@ -51,7 +51,7 @@ typedef struct
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
- uint32 completePasses; /* Complete cycles of the clock sweep */
+ uint32 completePasses; /* Complete cycles of the clock-sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
@@ -311,7 +311,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Nothing on the freelist, so run the "clock-sweep" algorithm */
trycounter = NBuffers;
for (;;)
{
@@ -511,7 +511,7 @@ StrategyInitialize(bool init)
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
- /* Initialize the clock sweep pointer */
+ /* Initialize the clock-sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
/* Clear statistics */
@@ -759,7 +759,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
*
* If usage_count is 0 or 1 then the buffer is fair game (we expect 1,
* since our own previous usage of the ring element would have left it
- * there, but it might've been decremented by clock sweep since then). A
+ * there, but it might've been decremented by clock-sweep since then). A
* higher usage_count indicates someone else has touched the buffer, so we
* shouldn't re-use it.
*/
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3c0d20f4659..04fef13409b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -229,7 +229,7 @@ GetLocalVictimBuffer(void)
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
- * Need to get a new buffer. We use a clock sweep algorithm (essentially
+ * Need to get a new buffer. We use a clock-sweep algorithm (essentially
* the same as what freelist.c does now...)
*/
trycounter = NLocBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..3a210c710f6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -80,8 +80,8 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
* The maximum allowed value of usage_count represents a tradeoff between
* accuracy and speed of the clock-sweep buffer management algorithm. A
* large value (comparable to NBuffers) would approximate LRU semantics.
- * But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
- * clock sweeps to find a free buffer, so in practice we don't want the
+ * But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of the
+ * clock-sweep hand to find a free buffer, so in practice we don't want the
* value to be very large.
*/
#define BM_MAX_USAGE_COUNT 5
--
2.49.0
v14-0002-Eliminate-the-freelist-from-the-buffer-manager-a.patchapplication/octet-streamDownload
From 66667982302c4ac8eb70ab067f3101cea4c02c1b Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Thu, 10 Jul 2025 14:45:32 -0400
Subject: [PATCH v14 2/3] Eliminate the freelist from the buffer manager and
depend on clock-sweep
This set of changes removes the list of available buffers and instead
simply uses the clock-sweep algorithm to find and return an available
buffer. While on the surface this appears to be removing an
optimization it is in fact eliminating code that induces overhead in the
form of synchronization that is problemmatic for multi-core systems.
This also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.
---
contrib/pg_prewarm/autoprewarm.c | 31 ++++---
src/backend/storage/buffer/README | 40 +++------
src/backend/storage/buffer/buf_init.c | 9 --
src/backend/storage/buffer/bufmgr.c | 29 +------
src/backend/storage/buffer/freelist.c | 119 +-------------------------
src/include/storage/buf_internals.h | 12 +--
6 files changed, 32 insertions(+), 208 deletions(-)
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index c01b9c7e6a4..2722b0bb443 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -370,6 +370,16 @@ apw_load_buffers(void)
apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0;
apw_state->prewarmed_blocks = 0;
+
+ /* Don't prewarm more than we can fit. */
+ if (num_elements > NBuffers)
+ {
+ num_elements = NBuffers;
+ ereport(LOG,
+ (errmsg("autoprewarm: capping prewarmed blocks to %d (shared_buffers size)",
+ NBuffers)));
+ }
+
/* Get the info position of the first block of the next database. */
while (apw_state->prewarm_start_idx < num_elements)
{
@@ -410,10 +420,6 @@ apw_load_buffers(void)
apw_state->database = current_db;
Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
- /* If we've run out of free buffers, don't launch another worker. */
- if (!have_free_buffer())
- break;
-
/*
* Likewise, don't launch if we've already been told to shut down.
* (The launch would fail anyway, but we might as well skip it.)
@@ -462,12 +468,6 @@ apw_read_stream_next_block(ReadStream *stream,
{
BlockInfoRecord blk = p->block_info[p->pos];
- if (!have_free_buffer())
- {
- p->pos = apw_state->prewarm_stop_idx;
- return InvalidBlockNumber;
- }
-
if (blk.tablespace != p->tablespace)
return InvalidBlockNumber;
@@ -523,10 +523,10 @@ autoprewarm_database_main(Datum main_arg)
blk = block_info[i];
/*
- * Loop until we run out of blocks to prewarm or until we run out of free
+ * Loop until we run out of blocks to prewarm or until we run out of
* buffers.
*/
- while (i < apw_state->prewarm_stop_idx && have_free_buffer())
+ while (i < apw_state->prewarm_stop_idx)
{
Oid tablespace = blk.tablespace;
RelFileNumber filenumber = blk.filenumber;
@@ -568,14 +568,13 @@ autoprewarm_database_main(Datum main_arg)
/*
* We have a relation; now let's loop until we find a valid fork of
- * the relation or we run out of free buffers. Once we've read from
- * all valid forks or run out of options, we'll close the relation and
+ * the relation or we run out of buffers. Once we've read from all
+ * valid forks or run out of options, we'll close the relation and
* move on.
*/
while (i < apw_state->prewarm_stop_idx &&
blk.tablespace == tablespace &&
- blk.filenumber == filenumber &&
- have_free_buffer())
+ blk.filenumber == filenumber)
{
ForkNumber forknum = blk.forknum;
BlockNumber nblocks;
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 4b13da5d7ad..119f31b5d65 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement. A spinlock is used here rather than a lightweight
-lock for efficiency; no other locks of any sort should be acquired while
-buffer_strategy_lock is held. This is essential to allow buffer replacement
-to happen in multiple backends with reasonable concurrency.
+exclusion for operations that select buffers for replacement. A spinlock is
+used here rather than a lightweight lock for efficiency; no other locks of any
+sort should be acquired while buffer_strategy_lock is held. This is essential
+to allow buffer replacement to happen in multiple backends with reasonable
+concurrency.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -158,18 +158,8 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
----------------------------------
-There is a "free list" of buffers that are prime candidates for replacement.
-In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
-buffer headers; we maintain head and tail pointers in global variables.
-(Note: although the list links are in the buffer headers, they are
-considered to be protected by the buffer_strategy_lock, not the buffer-header
-spinlocks.) To choose a victim buffer to recycle when there are no free
-buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm. It
+works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -184,20 +174,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
-2. If buffer free list is nonempty, remove its head buffer. Release
-buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
-it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
-and return it.
+2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
+nextVictimBuffer for next time. Release buffer_strategy_lock.
-3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
-nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
-Release buffer_strategy_lock.
-
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
+3. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
-5. Pin the selected buffer, and return.
+4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -234,7 +218,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
-from the ring. Instead, WAL is flushed if needed to allow reuse of the
+from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..6fd3a6bbac5 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -128,20 +128,11 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
- /*
- * Initially link all the buffers together as unused. Subsequent
- * management of this list is done by freelist.c.
- */
- buf->freeNext = i + 1;
-
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
-
- /* Correct last entry of linked list */
- GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
}
/* Init other shared buffer-management stuff */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 396b053b3fa..719a5bb6f97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2094,12 +2094,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*/
UnpinBuffer(victim_buf_hdr);
- /*
- * The victim buffer we acquired previously is clean and unused, let
- * it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
-
/* remaining code should match code at top of routine */
existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
@@ -2158,8 +2152,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
}
/*
- * InvalidateBuffer -- mark a shared buffer invalid and return it to the
- * freelist.
+ * InvalidateBuffer -- mark a shared buffer invalid.
*
* The buffer header spinlock must be held at entry. We drop it before
* returning. (This is sane because the caller must have locked the
@@ -2257,11 +2250,6 @@ retry:
* Done with mapping lock.
*/
LWLockRelease(oldPartitionLock);
-
- /*
- * Insert the buffer at the head of the list of free buffers.
- */
- StrategyFreeBuffer(buf);
}
/*
@@ -2679,11 +2667,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
{
BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(buf_hdr);
UnpinBuffer(buf_hdr);
}
@@ -2756,12 +2739,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
valid = PinBuffer(existing_hdr, strategy);
LWLockRelease(partition_lock);
-
- /*
- * The victim buffer we acquired previously is clean and unused,
- * let it be found again quickly
- */
- StrategyFreeBuffer(victim_buf_hdr);
UnpinBuffer(victim_buf_hdr);
buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
@@ -3658,8 +3635,8 @@ BgBufferSync(WritebackContext *wb_context)
uint32 new_recent_alloc;
/*
- * Find out where the freelist clock-sweep currently is, and how many
- * buffer allocations have happened since our last call.
+ * Find out where the clock-sweep currently is, and how many buffer
+ * allocations have happened since our last call.
*/
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index cd94a7d8a7b..7d59a92bd1a 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -39,14 +39,6 @@ typedef struct
*/
pg_atomic_uint32 nextVictimBuffer;
- int firstFreeBuffer; /* Head of list of unused buffers */
- int lastFreeBuffer; /* Tail of list of unused buffers */
-
- /*
- * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
- * when the list is empty)
- */
-
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -163,23 +155,6 @@ ClockSweepTick(void)
return victim;
}
-/*
- * have_free_buffer -- a lockless check to see if there is a free buffer in
- * buffer pool.
- *
- * If the result is true that will become stale once free buffers are moved out
- * by other operations, so the caller who strictly want to use a free buffer
- * should not call this.
- */
-bool
-have_free_buffer(void)
-{
- if (StrategyControl->firstFreeBuffer >= 0)
- return true;
- else
- return false;
-}
-
/*
* StrategyGetBuffer
*
@@ -249,69 +224,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
*/
pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
- /*
- * First check, without acquiring the lock, whether there's buffers in the
- * freelist. Since we otherwise don't require the spinlock in every
- * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
- * uselessly in most cases. That obviously leaves a race where a buffer is
- * put on the freelist but we don't see the store yet - but that's pretty
- * harmless, it'll just get used during the next buffer acquisition.
- *
- * If there's buffers on the freelist, acquire the spinlock to pop one
- * buffer of the freelist. Then check whether that buffer is usable and
- * repeat if not.
- *
- * Note that the freeNext fields are considered to be protected by the
- * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
- * manipulate them without holding the spinlock.
- */
- if (StrategyControl->firstFreeBuffer >= 0)
- {
- while (true)
- {
- /* Acquire the spinlock to remove element from the freelist */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- if (StrategyControl->firstFreeBuffer < 0)
- {
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- break;
- }
-
- buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
-
- /*
- * Release the lock so someone else can access the freelist while
- * we check out this buffer.
- */
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot
- * use it; discard it and retry. (This can only happen if VACUUM
- * put a valid buffer in the freelist and then someone else used
- * it before we got to it. It's probably impossible altogether as
- * of 8.3, but we'd better check anyway.)
- */
- local_buf_state = LockBufHdr(buf);
- if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
- && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
- {
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- *buf_state = local_buf_state;
- return buf;
- }
- UnlockBufHdr(buf, local_buf_state);
- }
- }
-
- /* Nothing on the freelist, so run the "clock-sweep" algorithm */
+ /* Use the "clock sweep" algorithm to find a free buffer */
trycounter = NBuffers;
for (;;)
{
@@ -356,29 +269,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
-/*
- * StrategyFreeBuffer: put a buffer on the freelist
- */
-void
-StrategyFreeBuffer(BufferDesc *buf)
-{
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
- /*
- * It is possible that we are told to put something in the freelist that
- * is already in it; don't screw up the list if so.
- */
- if (buf->freeNext == FREENEXT_NOT_IN_LIST)
- {
- buf->freeNext = StrategyControl->firstFreeBuffer;
- if (buf->freeNext < 0)
- StrategyControl->lastFreeBuffer = buf->buf_id;
- StrategyControl->firstFreeBuffer = buf->buf_id;
- }
-
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-}
-
/*
* StrategySyncStart -- tell BgBufferSync where to start syncing
*
@@ -504,13 +394,6 @@ StrategyInitialize(bool init)
SpinLockInit(&StrategyControl->buffer_strategy_lock);
- /*
- * Grab the whole linked list of free buffers for our strategy. We
- * assume it was previously set up by BufferManagerShmemInit().
- */
- StrategyControl->firstFreeBuffer = 0;
- StrategyControl->lastFreeBuffer = NBuffers - 1;
-
/* Initialize the clock-sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3a210c710f6..9fcc94ef02d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index)
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
- * after initialization, so does not need locking. freeNext is protected by
- * the buffer_strategy_lock not buffer header lock. The LWLock can take care
+ * after initialization, so does not need locking. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
*
@@ -264,7 +263,6 @@ typedef struct BufferDesc
pg_atomic_uint32 state;
int wait_backend_pgprocno; /* backend of pin-count waiter */
- int freeNext; /* link in freelist chain */
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
@@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
return (LWLock *) (&bdesc->content_lock);
}
-/*
- * The freeNext field is either the index of the next freelist entry,
- * or one of these special values:
- */
-#define FREENEXT_END_OF_LIST (-1)
-#define FREENEXT_NOT_IN_LIST (-2)
-
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@@ -453,7 +444,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
-extern bool have_free_buffer(void);
/* buf_table.c */
extern Size BufTableShmemSize(int size);
--
2.49.0
v14-0003-Track-buffer-usage-per-backend-and-use-that-to-i.patchapplication/octet-streamDownload
From 922b056074f24e7eb6df971f69b166fc476e38d3 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 26 Aug 2025 13:40:19 -0400
Subject: [PATCH v14 3/3] Track buffer usage per-backend and use that to inform
buffer managment
Implement a comprehensive buffer pressure monitoring and management system
to improve PostgreSQL's buffer replacement efficiency under high load.
Problems:
- Buffer pressure was only detectable reactively during allocation failures
- No visibility into which backends contribute most to buffer contention
- High usage counts force multiple clock-sweep passes under heavy load
- bgwriter had limited intelligence about system-wide buffer usage patterns
Solution:
- Add per-backend buffer usage counters
- Implement proactive buffer pressure calculation in bgwriter
- Add targeted buffer writing for high-usage backends (90th percentile)
- Adjust rate of usage count reduction when pressure exceeds 75% threshold
Benefits:
- Reduces buffer allocation stalls by proactively managing pressure
- Provides fair resource management by targeting high-usage backends
- Improves system responsiveness under memory pressure
- Maintains backward compatibility with existing buffer management
- Enables better observability of buffer usage patterns per backend
---
src/backend/postmaster/bgwriter.c | 320 ++++++++++++++++++++++++++
src/backend/storage/buffer/buf_init.c | 3 +
src/backend/storage/buffer/bufmgr.c | 9 +
src/backend/storage/buffer/freelist.c | 70 +++++-
src/backend/storage/lmgr/proc.c | 8 +
src/include/storage/buf_internals.h | 2 +
src/include/storage/proc.h | 10 +
src/tools/pgindent/typedefs.list | 2 +
8 files changed, 420 insertions(+), 4 deletions(-)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 72f5acceec7..3d44374f5ab 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -77,6 +77,283 @@ int BgWriterDelay = 200;
static TimestampTz last_snapshot_ts;
static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+/*
+ * Collected buffer usage information.
+ */
+typedef struct BackendBufferStats
+{
+ int backend_id;
+ uint64 usage_sum;
+ double usage_ratio;
+} BackendBufferStats;
+
+static int
+compare_backend_usage(const void *a, const void *b)
+{
+ const BackendBufferStats *stat_a = (const BackendBufferStats *) a;
+ const BackendBufferStats *stat_b = (const BackendBufferStats *) b;
+
+ if (stat_a->usage_ratio < stat_b->usage_ratio)
+ return -1;
+ if (stat_a->usage_ratio > stat_b->usage_ratio)
+ return 1;
+ return 0;
+}
+
+static uint64
+CalculateSystemBufferPressure(BackendBufferStats *backend_stats[], int *num_backends)
+{
+ uint64 total_usage = 0;
+ int active_backends = 0;
+ BackendBufferStats *stats;
+
+ /* Count active backends first */
+ for (int i = 0; i < ProcGlobal->allProcCount; i++)
+ {
+ PGPROC *proc = &ProcGlobal->allProcs[i];
+
+ if (proc->pid != 0 && proc->databaseId != InvalidOid)
+ active_backends++;
+ }
+
+ if (active_backends == 0)
+ {
+ *backend_stats = NULL;
+ *num_backends = 0;
+ return 0;
+ }
+
+ /* Allocate stats array */
+ stats = palloc(sizeof(BackendBufferStats) * active_backends);
+ *backend_stats = stats;
+ *num_backends = active_backends;
+
+ /* Collect stats from all active backends */
+ for (int i = 0, j = 0; i < ProcGlobal->allProcCount; i++)
+ {
+ PGPROC *proc = &ProcGlobal->allProcs[i];
+
+ if (proc->pid != 0 && proc->databaseId != InvalidOid)
+ {
+ uint64 usage_sum = pg_atomic_read_u32(&proc->bufferUsageSum);
+
+ stats[j].backend_id = i;
+ stats[j].usage_sum = usage_sum;
+ stats[j].usage_ratio = (double) usage_sum / NBuffers;
+ total_usage += usage_sum;
+ j++;
+ }
+ }
+
+ /* Sort by usage ratio for percentile calculation */
+ qsort(stats, active_backends, sizeof(BackendBufferStats),
+ compare_backend_usage);
+
+ return total_usage;
+}
+
+static void
+GetHighUsageBackends(BackendBufferStats *stats, int num_backends,
+ int **high_usage_backends, int *num_high_usage)
+{
+ int percentile_90_idx = (int) (num_backends * 0.9);
+
+ *num_high_usage = num_backends - percentile_90_idx;
+
+ if (*num_high_usage > 0)
+ {
+ *high_usage_backends = palloc(sizeof(int) * (*num_high_usage));
+ for (int i = 0; i < *num_high_usage; i++)
+ (*high_usage_backends)[i] = stats[percentile_90_idx + i].backend_id;
+ }
+ else
+ {
+ *high_usage_backends = NULL;
+ *num_high_usage = 0;
+ }
+}
+
+/*
+ * Shared buffer sync function used by both main loop and aggressive writing
+ */
+static int
+SyncTargetedBuffers(WritebackContext *wb_context, int *target_backends,
+ int num_targets, int max_buffers)
+{
+ int buffers_written = 0;
+ int buffer_id;
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ /* If no specific targets, sync any dirty buffers */
+ if (target_backends == NULL || num_targets == 0)
+ return BgBufferSync(wb_context);
+
+ /* Scan through all buffers looking for dirty ones from target backends */
+ for (buffer_id = 0; buffer_id < NBuffers && buffers_written < max_buffers; buffer_id++)
+ {
+ uint32 dirty_backend;
+ bool is_target;
+
+ bufHdr = GetBufferDescriptor(buffer_id);
+
+ /* Quick check if buffer is dirty */
+ buf_state = pg_atomic_read_u32(&bufHdr->state);
+ if (!(buf_state & BM_DIRTY))
+ continue;
+
+ /* Check if this buffer is from one of our target backends */
+ dirty_backend = pg_atomic_read_u32(&bufHdr->dirty_backend_id);
+ is_target = false;
+
+ for (int i = 0; i < num_targets; i++)
+ if (dirty_backend == target_backends[i])
+ {
+ is_target = true;
+ break;
+ }
+
+ if (!is_target)
+ continue;
+
+ /* Skip if buffer is pinned */
+ if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+ continue;
+
+ /* Try to write this buffer using the writeback context */
+ ScheduleBufferTagForWriteback(wb_context,
+ IOContextForStrategy(NULL),
+ &bufHdr->tag);
+ buffers_written++;
+ }
+
+ /* Issue the actual writes */
+ if (buffers_written > 0)
+ IssuePendingWritebacks(wb_context, IOContextForStrategy(NULL));
+
+ return buffers_written;
+}
+
+static void
+AggressiveBufferWrite(WritebackContext *wb_context, int *high_usage_backends,
+ int num_high_usage, bool critical)
+{
+ int write_target = critical ? bgwriter_lru_maxpages * 3 : bgwriter_lru_maxpages * 2;
+ int buffers_written = 0;
+
+ /* Focus on buffers from high-usage backends first */
+ buffers_written = SyncTargetedBuffers(wb_context, high_usage_backends,
+ num_high_usage, write_target);
+
+ /* If still under target, write additional dirty buffers */
+ if (buffers_written < write_target)
+ BgBufferSync(wb_context);
+}
+
+/* In src/backend/postmaster/bgwriter.c - Enhanced UpdateBackendDecayRates */
+static void
+UpdateBackendDecayRates(BackendBufferStats *backend_stats, int num_backends,
+ double pressure_ratio, int *high_usage_backends, int num_high_usage)
+{
+ uint32 base_decay_rate;
+ uint64 total_usage = 0;
+ uint64 avg_usage;
+ int i,
+ j;
+
+ /* Calculate base decay rate from system pressure */
+ if (pressure_ratio > 0.90)
+ /* Critical pressure - aggressive decay */
+ base_decay_rate = 3;
+ else if (pressure_ratio > 0.75)
+ /* High pressure */
+ base_decay_rate = 2;
+ else
+ /* Normal decay rate */
+ base_decay_rate = 1;
+
+ /* Calculate total usage for relative comparisons */
+ for (i = 0; i < num_backends; i++)
+ total_usage += backend_stats[i].usage_sum;
+ avg_usage = num_backends > 0 ? total_usage / num_backends : 0;
+
+ if (base_decay_rate > 1)
+ elog(DEBUG2, "Buffer pressure: %.2f%%, base decay rate: %u, avg usage: %lu",
+ pressure_ratio * 100, base_decay_rate, avg_usage);
+
+ /* Update each backend's personalized decay rate */
+ for (i = 0; i < ProcGlobal->allProcCount; i++)
+ {
+ PGPROC *proc = &ProcGlobal->allProcs[i];
+
+ /* Only update active user backends */
+ if (proc->pid != 0 && proc->databaseId != InvalidOid)
+ {
+ uint32 backend_usage = pg_atomic_read_u32(&proc->bufferUsageSum);
+ uint32 personalized_rate = base_decay_rate;
+
+ /* Find this backend in the stats array */
+ BackendBufferStats *backend_stat = NULL;
+
+ for (j = 0; j < num_backends; j++)
+ {
+ if (backend_stats[j].backend_id == i)
+ {
+ backend_stat = &backend_stats[j];
+ break;
+ }
+ }
+
+ /*
+ * Calculate personalized decay rate based on usage and
+ * clock-sweep performance.
+ */
+ if (backend_stat != NULL && avg_usage > 0)
+ {
+ double usage_ratio = (double) backend_usage / avg_usage;
+
+ /* Get clock-sweep performance metrics */
+ uint32 search_count = pg_atomic_read_u32(&proc->bufferSearchCount);
+ uint64 total_distance = pg_atomic_read_u64(&proc->clockSweepDistance);
+ uint32 total_passes = pg_atomic_read_u32(&proc->clockSweepPasses);
+ uint64 total_time = pg_atomic_read_u64(&proc->clockSweepTimeMicros);
+
+ /* Calculate average search metrics */
+ double avg_distance = search_count > 0 ? (double) total_distance / search_count : 0;
+ double avg_passes = search_count > 0 ? (double) total_passes / search_count : 0;
+ double avg_time = search_count > 0 ? (double) total_time / search_count : 0;
+
+ /* Adjust decay rate based on usage relative to average */
+ if (usage_ratio > 2.0)
+ /* High usage backends get more aggressive decay */
+ personalized_rate = Min(4, base_decay_rate + 2);
+ else if (usage_ratio > 1.5)
+ personalized_rate = Min(4, base_decay_rate + 1);
+ else if (usage_ratio < 0.5)
+ /* Low usage backends get less aggressive decay */
+ personalized_rate = Max(1, base_decay_rate > 1 ? base_decay_rate - 1 : 1);
+
+ /* Further adjust based on clock-sweep performance */
+ if (avg_distance > NBuffers * 0.5)
+ /* Searching more than half the buffer pool */
+ personalized_rate = Min(4, personalized_rate + 1);
+ if (avg_passes > 1.0)
+ /* Making multiple complete passes */
+ personalized_rate = Min(4, personalized_rate + 1);
+ if (avg_time > 1000.0)
+ /* Taking more than 1ms per search */
+ personalized_rate = Min(4, personalized_rate + 1);
+
+ elog(DEBUG2, "Backend %d: usage_ratio=%.2f, avg_distance=%.1f, avg_passes=%.2f, "
+ "avg_time=%.1fμs, decay_rate=%u",
+ i, usage_ratio, avg_distance, avg_passes, avg_time, personalized_rate);
+ }
+
+ /* Update the backend's decay rate */
+ pg_atomic_write_u32(&proc->bufferDecayRate, personalized_rate);
+ }
+ }
+}
/*
* Main entry point for bgwriter process
@@ -222,6 +499,15 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
*/
for (;;)
{
+ BackendBufferStats *backend_stats = NULL;
+ int num_backends;
+ int *high_usage_backends = NULL;
+ int num_high_usage;
+ uint64 max_possible;
+ uint64 total_usage;
+ double pressure_ratio;
+ bool high_pressure;
+ bool critical_pressure;
bool can_hibernate;
int rc;
@@ -230,6 +516,35 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
ProcessMainLoopInterrupts();
+ /* Calculate current buffer pressure */
+ total_usage = CalculateSystemBufferPressure(&backend_stats, &num_backends);
+ max_possible = (uint64) NBuffers * BM_MAX_USAGE_COUNT;
+ total_usage = total_usage > max_possible ? max_possible : total_usage;
+ pressure_ratio = (double) total_usage / max_possible;
+
+ /* Get high-usage backends (90th percentile) */
+ if (backend_stats != NULL)
+ GetHighUsageBackends(backend_stats, num_backends,
+ &high_usage_backends, &num_high_usage);
+
+ /* Update global decay rate based on current pressure */
+ UpdateBackendDecayRates(backend_stats, num_backends, pressure_ratio,
+ high_usage_backends, num_high_usage);
+
+ /* Determine if proactive action is needed */
+ high_pressure = pressure_ratio > 0.75; /* 75% threshold */
+ critical_pressure = pressure_ratio > 0.90; /* 90% threshold */
+
+ if (high_pressure)
+ {
+ elog(LOG, "%s buffer pressure detected: %.2f%% (%d high-usage backends)",
+ critical_pressure ? "Critical" : "High",
+ pressure_ratio * 100, num_high_usage);
+
+ /* Aggressive writing of dirty buffers */
+ AggressiveBufferWrite(&wb_context, high_usage_backends, num_high_usage, critical_pressure);
+ }
+
/*
* Do one cycle of dirty-buffer writing.
*/
@@ -294,6 +609,11 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
}
}
+ if (backend_stats != NULL)
+ pfree(backend_stats);
+ if (high_usage_backends != NULL)
+ pfree(high_usage_backends);
+
/*
* Sleep until we are signaled or BgWriterDelay has elapsed.
*
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 6fd3a6bbac5..dfc5e1f5696 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -124,6 +124,9 @@ BufferManagerShmemInit(void)
pg_atomic_init_u32(&buf->state, 0);
buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+ /* Initialize dirty backend tracking */
+ pg_atomic_init_u32(&buf->dirty_backend_id, INVALID_PROC_NUMBER);
+
buf->buf_id = i;
pgaio_wref_clear(&buf->io_wref);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 719a5bb6f97..ffa52acfbeb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2136,6 +2136,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
* just like permanent relations.
*/
victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+ if (MyProc != NULL)
+ pg_atomic_add_fetch_u32(&MyProc->bufferUsageSum, 1);
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
victim_buf_state |= BM_PERMANENT;
@@ -2781,6 +2783,8 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
victim_buf_hdr->tag = tag;
buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+ if (MyProc != NULL)
+ pg_atomic_add_fetch_u32(&MyProc->bufferUsageSum, 1);
if (bmr.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
buf_state |= BM_PERMANENT;
@@ -2950,6 +2954,11 @@ MarkBufferDirty(Buffer buffer)
Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
buf_state |= BM_DIRTY | BM_JUST_DIRTIED;
+ /* Track which backend dirtied this buffer */
+ if (MyProc != NULL)
+ pg_atomic_write_u32(&bufHdr->dirty_backend_id,
+ MyProc - ProcGlobal->allProcs);
+
if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
buf_state))
break;
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..7a7b8b1ab4e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -81,7 +81,7 @@ typedef struct BufferAccessStrategyData
* struct.
*/
Buffer buffers[FLEXIBLE_ARRAY_MEMBER];
-} BufferAccessStrategyData;
+} BufferAccessStrategyData;
/* Prototypes for internal functions */
@@ -174,6 +174,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
int bgwprocno;
int trycounter;
uint32 local_buf_state; /* to avoid repeated (de-)referencing */
+ uint32 backend_decay_rate;
+
+ /* Clock-sweep performance tracking */
+ instr_time start_time,
+ end_time;
+ uint64 buffers_examined = 0;
+ uint32 complete_passes = 0;
+ uint32 initial_clock_hand;
*from_ring = false;
@@ -191,6 +199,18 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
}
}
+ initial_clock_hand = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
+ if (initial_clock_hand >= NBuffers)
+ initial_clock_hand %= NBuffers;
+
+ /* Start timing the buffer search */
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ /* Get this backend's personalized decay rate */
+ backend_decay_rate = pg_atomic_read_u32(&MyProc->bufferDecayRate);
+ if (backend_decay_rate == 0)
+ backend_decay_rate = 1;
+
/*
* If asked, we need to waken the bgwriter. Since we don't want to rely on
* a spinlock for this we force a read from shared memory once, and then
@@ -228,7 +248,10 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
trycounter = NBuffers;
for (;;)
{
- buf = GetBufferDescriptor(ClockSweepTick());
+ uint32 hand = ClockSweepTick();
+
+ buf = GetBufferDescriptor(hand);
+ buffers_examined++;
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -238,18 +261,53 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
{
- if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
+ uint32 current_usage = BUF_STATE_GET_USAGECOUNT(local_buf_state);
+
+ if (current_usage != 0)
{
- local_buf_state -= BUF_USAGECOUNT_ONE;
+ uint32 current_sum;
+ uint32 new_sum;
+ uint32 decay_amount = Min(current_usage, backend_decay_rate);
+
+ local_buf_state -= decay_amount * BUF_USAGECOUNT_ONE;
+
+ do
+ {
+ current_sum = pg_atomic_read_u32(&MyProc->bufferUsageSum);
+ if (current_sum < decay_amount)
+ new_sum = 0;
+ else
+ new_sum = current_sum - decay_amount;
+ } while (!pg_atomic_compare_exchange_u32(&MyProc->bufferUsageSum,
+ ¤t_sum, new_sum));
trycounter = NBuffers;
}
else
{
+ uint64 search_time_micros;
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SUBTRACT(end_time, start_time);
+
+ search_time_micros = INSTR_TIME_GET_MICROSEC(end_time);
+
+ /* Update this backend's clock-sweep performance metrics */
+ pg_atomic_add_fetch_u64(&MyProc->clockSweepDistance, buffers_examined);
+ pg_atomic_add_fetch_u32(&MyProc->clockSweepPasses, complete_passes);
+ pg_atomic_add_fetch_u64(&MyProc->clockSweepTimeMicros, search_time_micros);
+ pg_atomic_add_fetch_u32(&MyProc->bufferSearchCount, 1);
+
+ elog(DEBUG2, "Buffer search completed: examined=%lu, passes=%u, time=%luμs, decay_rate=%u",
+ buffers_examined, complete_passes, search_time_micros, backend_decay_rate);
+
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
*buf_state = local_buf_state;
+
+ pg_atomic_add_fetch_u32(&MyProc->bufferUsageSum, 1);
+
return buf;
}
}
@@ -266,6 +324,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
elog(ERROR, "no unpinned buffers available");
}
UnlockBufHdr(buf, local_buf_state);
+
+ if (buffers_examined > 1 && hand == initial_clock_hand)
+ complete_passes++;
}
}
@@ -305,6 +366,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
}
+
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
return result;
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..fdb2554e3f5 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -528,6 +528,14 @@ InitProcess(void)
MyProc->clogGroupMemberLsn = InvalidXLogRecPtr;
Assert(pg_atomic_read_u32(&MyProc->clogGroupNext) == INVALID_PROC_NUMBER);
+ /* Initialize buffer usage tracking */
+ pg_atomic_init_u32(&MyProc->bufferUsageSum, 0);
+ pg_atomic_init_u32(&MyProc->bufferDecayRate, 1);
+ pg_atomic_init_u64(&MyProc->clockSweepDistance, 0);
+ pg_atomic_init_u32(&MyProc->clockSweepPasses, 0);
+ pg_atomic_init_u64(&MyProc->clockSweepTimeMicros, 0);
+ pg_atomic_init_u32(&MyProc->bufferSearchCount, 0);
+
/*
* Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
* on it. That allows us to repoint the process latch, which so far
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9fcc94ef02d..ac87bd90afd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -266,6 +266,8 @@ typedef struct BufferDesc
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
+
+ pg_atomic_uint32 dirty_backend_id; /* backend ID that last dirtied this buffer */
} BufferDesc;
/*
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..e5daaf99276 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -247,6 +247,16 @@ struct PGPROC
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */
+ /* Per-backend buffer usage tracking */
+ pg_atomic_uint32 bufferUsageSum; /* Running total of buffer usage */
+ pg_atomic_uint32 bufferDecayRate; /* Per-tick usage decay rate */
+
+ /* Clock-sweep performance metrics */
+ pg_atomic_uint64 clockSweepDistance; /* Total buffers examined */
+ pg_atomic_uint32 clockSweepPasses; /* Complete clock passes */
+ pg_atomic_uint64 clockSweepTimeMicros; /* Total time in microseconds */
+ pg_atomic_uint32 bufferSearchCount; /* Number of buffer searches */
+
/* Support for condition variables. */
proclist_node cvWaitLink; /* position in CV wait list */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..518f7aa3a92 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -234,6 +234,7 @@ BTWriteState
BUF_MEM
BYTE
BY_HANDLE_FILE_INFORMATION
+BackendBufferStats
BackendParameters
BackendStartupData
BackendState
@@ -336,6 +337,7 @@ Bucket
BufFile
Buffer
BufferAccessStrategy
+BufferAccessStrategyData
BufferAccessStrategyType
BufferCacheNumaContext
BufferCacheNumaRec
--
2.49.0
Hi,
On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
Regardless, I feel the first two patches on this set address the
intention of this thread.
I'm planning to commit the first two patches after making a pass through
them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
NUMA patches are affected too.
Greetings,
Andres Freund
On Thu, Sep 4, 2025, at 12:59 PM, Andres Freund wrote:
Hi,
On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
Regardless, I feel the first two patches on this set address the
intention of this thread.I'm planning to commit the first two patches after making a pass through
them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
NUMA patches are affected too.
Fantastic, and thank you. If you have any concerns I'm here to work through them with you. I'll carry over the idea from the 3rd patch into a new thread.
Greetings,
Andres Freund
best,
-greg
Hi,
On 2025-09-04 13:24:00 -0400, Greg Burd wrote:
On Thu, Sep 4, 2025, at 12:59 PM, Andres Freund wrote:
Hi,
On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
Regardless, I feel the first two patches on this set address the
intention of this thread.I'm planning to commit the first two patches after making a pass through
them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
NUMA patches are affected too.Fantastic, and thank you. If you have any concerns I'm here to work through
them with you. I'll carry over the idea from the 3rd patch into a new
thread.
Committed with very minor changes:
- removal of StrategyFreeBuffer() prototype
- the new message in autoprewarm used "autoprewarm: " in the log message,
which none of the other messages do - not the prettiest, but it's consistent
with what's already there
- some editing of the commit message
- some whitespace changes
Greetings,
Andres Freund
On Fri, Sep 5, 2025, at 12:27 PM, Andres Freund wrote:
Hi,
On 2025-09-04 13:24:00 -0400, Greg Burd wrote:
On Thu, Sep 4, 2025, at 12:59 PM, Andres Freund wrote:
Hi,
On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
Regardless, I feel the first two patches on this set address the
intention of this thread.I'm planning to commit the first two patches after making a pass through
them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
NUMA patches are affected too.Fantastic, and thank you. If you have any concerns I'm here to work through
them with you. I'll carry over the idea from the 3rd patch into a new
thread.Committed with very minor changes:
- removal of StrategyFreeBuffer() prototype
- the new message in autoprewarm used "autoprewarm: " in the log message,
which none of the other messages do - not the prettiest, but it's consistent
with what's already there
- some editing of the commit message
- some whitespace changes
All reasonable and thoughtful changes.
Greetings,
Andres Freund
thank you.
-greg
On Fri, Jul 11, 2025 at 3:48 PM Greg Burd <greg@burd.me> wrote:
I briefly considered how one might use what was left after surgery to produce some similar boolean signal to no avail. I think that autoprewarm was simply trying to at most warm NBuffers then stop. The freelist at startup was just a convenient thing to drain and get that done. Maybe I'll try adapting autoprewarm to consider that global instead.
My concern had been that while autoprewarm was running, other system
activity could already have started and been filling up
shared_buffers. By considering whether there were actually free
buffers remaining, it would prewarm less in that case.
I'm not saying that was the perfect idea, I'm just telling you what I
was thinking at the time.
--
Robert Haas
EDB: http://www.enterprisedb.com