[PATCH] Batched clock sweep to reduce cross-socket atomic contention

Started by Greg Burd3 days ago4 messageshackers

greg@burd.me

3 days ago

Hello hackers,

A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior on some of the newer AWS bare-metal instance types (r8i in particular, which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the process landed on a very small change to freelist.c that I think is worth showing around. His patch is attached with some tweaks of my own.

Full disclosure: the exploration that led Jim to this patch idea was done with help from an AI assistant (Kiro); the idea, the benchmarking, and the final shape of the patch are human-driven, but I wanted to be up front about how his investigation started. Happy to discuss that separately if people want to.

The one-line summary: instead of advancing nextVictimBuffer one buffer at a time via pg_atomic_fetch_add_u32, each backend claims a batch of 64 consecutive buffer IDs from the shared hand and then iterates them privately. Global sweep order is preserved -- every buffer is still
visited exactly once per complete pass -- but the atomic contention on that one cache line drops by roughly the batch size.

Why this matters
----------------

On multi-socket boxes under eviction pressure, every backend that needs a victim buffer ends up CAS'ing the same cache line. On a single socket, a locked RMW on that cache line stays warm in L1/L2 and completes in ~20ns. On 2+ sockets, the line bounces over QPI/UPI at ~100-200ns per op, and with hundreds of backends running StrategyGetBuffer() concurrently, the line ping-pongs constantly. It's a textbook NUMA scalability bottleneck, and once shared_buffers is smaller than the working set and the sweep is running continuously, that single atomic is what you hit in a perf profile (elevated bus-cycles, cache-misses on the cache line holding nextVictimBuffer).

Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas called it out in the "Adding basic NUMA awareness" thread [1]/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me -- so this isn't news to anyone who's been looking at this area. What I think is new is a fix that's just this, without any of the surrounding architectural change.

The framing (credit to Jim): the clock hand is doing two jobs. It *coordinates* backends so they don't redundantly decrement usage_count on the same buffers and so they eventually visit every buffer in the pool exactly once per pass. It also *serializes* access to the counter. Coordination is the part we want. Serialization is the part that's killing us on bigger NUMA boxes. Batching keeps the coordination and thins out the serialization.

How it works
------------

Two per-backend statics, MyBatchPos and MyBatchEnd. When a backend calls ClockSweepTick() and its local batch is exhausted, it does a single fetch-add of CLOCK_SWEEP_BATCH_SIZE (64) against nextVictimBuffer and now owns that range. Subsequent ticks just bump the local counter.

Wraparound got a small rewrite. The original code had the backend that crossed NBuffers drive completePasses++ under the spinlock via a CAS loop. With batching, multiple backends can each land a fetch-add that returns a value >= NBuffers in the same pass, so the logic now is: whoever sees a start >= NBuffers takes the spinlock, re-reads the counter, and if it's still out of range does a single CAS to wrap it and bumps completePasses. If somebody else already wrapped, we just release and move on. StrategySyncStart() still sees a consistent (nextVictimBuffer, completePasses) pair.

The batch size is gated on whether we actually have multiple NUMA nodes. On a single-socket box the atomic is already socket-local, batching just makes backends skip further ahead than they need to, so we fall back to batch size 1 -- which is bit-for-bit the original behavior. The guard:

if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
else
ClockSweepBatchSize = 1;

Min() against NBuffers covers the small-shared_buffers corner so a batch never wraps the pool multiple times in one claim.

Does batching mess up the meaning of usage_count?
--------------------------------------------------

Short answer: no. I want to walk through this because it was my first concern too, and I think it's the question that will come up most on review.

The clock sweep's usage_count is an access-frequency approximation measured in units of *complete passes*. A buffer with usage_count = N survives N passes without a re-pin. The semantic meaning lives at pass granularity, not at individual-buffer granularity.

What batching changes: intra-pass temporal ordering. Without batching, with N backends sweeping, decrements are interleaved -- backend A hits B[0], backend B hits B[1]/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me, backend C hits B[2]/messages/by-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me. With batching, backend A hits B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The 64-buffer chunks are decremented in bursts rather than individually.

Why it doesn't matter:

1. Every buffer still gets decremented exactly once per complete
pass. The invariant the algorithm actually depends on is
untouched.

2. A buffer's survival window is the time between consecutive
passes. That's milliseconds to seconds under load. Whether
B[0] gets decremented 50us before or 50us after B[63] within
the same pass is below the resolution of anything usage_count
is trying to measure.

3. The bgwriter's feedback loop reads (nextVictimBuffer,
completePasses, numBufferAllocs) via StrategySyncStart() every
~200ms. nextVictimBuffer still advances at the same *total*
rate (64 per atomic op, but atomic ops happen 1/64 as often).
The position it reports can jitter by up to 64 buffers relative
to the one-at-a-time case, but BgBufferSync()'s smoothed
estimates operate over thousands of buffers per cycle, so the
jitter disappears into the averaging. numBufferAllocs still
increments once per allocation. strategy_delta,
smoothed_alloc, smoothed_density, reusable_buffers_est -- all
unaffected in any way I can see.

Table form, because it's easier to argue with:

There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
doesn't show up where it would hurt.

There's also a small positive side-effect: cache locality. The backend that just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in L1/L2. Walking B[0..63] locally is cheaper than walking a striped interleaving where each descriptor was last touched by a different core. I haven't tried to isolate this in perf, but it falls out naturally.

Benchmarks
----------

Jim ran these; I'm still working on reproducing them locally and will post independent numbers in a follow-up. All bare metal, Linux, huge pages enabled throughout (more on that below), postmaster pinned to node 0 with `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K depending on which node the postmaster happened to land on at launch -- worth flagging for anyone trying to reproduce.

Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the working set always spills and the sweep is hot.

r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):

pgbench RO:
Clients Stock Patched Delta
64 31,457 36,353 +16%
128 31,678 37,864 +20%
256 31,510 37,558 +19%
384 31,431 37,464 +19%
512 31,329 37,040 +18%

pgbench RW:
Clients Stock Patched Delta
64 7,685 7,713 0%
128 10,420 10,541 +1%
256 12,393 12,463 +1%
384 15,317 15,197 -1%
512 17,930 17,978 0%

m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
RO +19-20%, RW within noise.

c8i.metal-48xl (192 vCPUs, 1 socket):
Single-socket -> batch_size=1 -> original code path. No
behavioral change. (I double-checked this one specifically
because it's the sanity test for the gate.)

HammerDB TPC-C on m6i.metal (1000 warehouses):
VUs Stock Patched Delta
128 358,518 349,787 -2%
256 332,098 330,272 -1%
384 365,782 377,519 +3%
512 370,663 386,526 +4%

No TPC-C regression, which was the thing we were most worried about. An earlier attempt (per-socket partitioned sweep, see below) was -13% on this same workload.

The general shape is: the scaling curve flattens later. Unpatched, TPS tops out around 128 clients and stays flat up to 512 because backends are spending cycles waiting on the cache line rather than
doing work. Patched, the curve keeps rising past the point where unpatched plateaus.

Huge pages caveat: all of the above was run with huge pages on, on large-memory instances (the r8i.96xl has 3TB, so Jim never considered running without them). We have not characterized the non-huge-pages case. That's on my list; I don't expect it to change the conclusion, but I shouldn't speak for data I haven't collected.

Relationship to Tomas's NUMA series
-----------------------------------

Tomas posted a multi-patch NUMA-awareness series in [1]/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.

One thing I found striking as I re-read the thread: in the benchmarks Tomas posted later in the series, *most of the benefit comes from partitioning the clock sweep*, and the NUMA memory-placement layer on top sometimes runs slower than partitioning alone. His own conclusion, quoted roughly: the benefit mostly comes from just partitioning the clock sweep, and it's largely independent of the NUMA stuff; the NUMA partitioning is often slower.

That observation is the thing that makes me think batching is worth considering on its own. It's going after the same bottleneck Tomas's partitioning addresses, but:

- without splitting global eviction visibility (which is where
cross-partition stealing gets complicated),
- without requiring NUMA-aware buffer placement (which has huge
page alignment, descriptor-partition-mid-page, and resize
complications that are still being worked out in that thread),
- without touching PGPROC or bgwriter.

What this patch does *not* do:
- place buffers on specific NUMA nodes
- partition the freelist
- touch PGPROC
- add new GUCs
- change bgwriter

What this patch *does* do:
- target exactly the clock-sweep contention that Tomas's
partitioning targets, and reduce it by ~64x, in ~30 lines.

If Tomas's series lands in full, this patch becomes redundant for its primary use case (though even within a partitioned sweep, the per-partition atomic still benefits from batching, so it's arguably a useful primitive either way). If Tomas's series lands incrementally over several cycles -- which the open items in that thread suggest is the realistic path -- this gets us a real chunk of the multi-socket win now.

This patch is also orthogonal to my earlier thread about removing the freelist entirely [2]/messages/by-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me, but given the proximity to that code Jim agreed that I could propose/steward it here on the list for consideration.

Open questions / things I'd like feedback on
--------------------------------------------

- Batch size. 64 is a round number that worked well in testing, but
Nathan raised the reasonable point that on small shared_buffers
with high concurrency, a fixed 64 could be unfortunate. Options:
scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
with max_connections, keep it fixed but let operators tune it, or
make it a function of NUMA node count. I don't have a strong
opinion yet; the Min(batch, NBuffers) cap covers the "obviously
wrong" corner but doesn't speak to the "several hundred backends
on a few-MB shared_buffers" shape. Numbers/ideas/proposals welcome.

- NUMA detection. The gate uses pg_numa_init() /
pg_numa_get_max_node(). On systems where libnuma isn't available,
or where get_mempolicy is blocked (some container configurations),
we fall back to batch size 1. That's safe but it misses the
"single socket, many cores, still benefits from fewer atomics"
case. Might be worth a way to force-enable, or batching on all
systems with a smaller batch size when single-socket. I'd like to
measure before deciding.

- Eviction pattern on reads. Nathan also flagged that with batching,
the buffers a backend ends up pinning in one StrategyGetBuffer()
call will tend to be contiguous in buffer-id space rather than
scattered, which is a different allocation pattern than today.
The usage_count analysis above says this is benign, but if anyone
has an intuition for a workload where this would be observable
(e.g., something that cares about the mapping between buffer-id
and relation locality), I'd like to hear it.

- nextVictimBuffer wraparound. The current code has a mild overflow
concern papered over with "highly unlikely and wouldn't be
particularly harmful". With batching this is no worse than before,
but if we're already touching this function, it might be worth
thinking about whether to tighten it up in the same patch or a
follow-up.

- Should the non-NUMA value for this be derived from core counts that
imply L1/L2 cache layouts or simply default to 8 rather than 1 to
realize some benefit?

- Should there be a postgresql.conf setting for this that takes
precedence?

I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at the small-shared_buffers corner, and post perf stat output showing the atomic/cache-miss deltas over the next few days. In the meantime, eyeballs and skepticism welcome -- I would especially welcome comments from Andres, who's been in this code recently, and from Tomas, whose series has the most overlap.

I realize that we're past feature freeze and working on release notes for v19, so the chances of merging this are slim to none. I think this could be considered a "performance bug fix for NUMA systems" in this release, but that is stretching it a bit. It is a big ask at this stage to land a change like this.

best.

-greg

[1]: /messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
[2]: /messages/by-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me

Greg Burd

greg@burd.me

1 day ago

In reply to: Greg Burd (#1)

Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention

On Sat, Apr 25, 2026, at 4:08 PM, Greg Burd wrote:

Hello hackers,

Hi again, attached is v2:

0001 - unchanged, batches clock-sweep to reduce contention
0002 - changed ComputeClockBatchSize() such that non-NUMA multi-core systems use batches as well and no longer default to batch size 1

Details below...

A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior
on some of the newer AWS bare-metal instance types (r8i in particular,
which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the
process landed on a very small change to freelist.c that I think is
worth showing around. His patch is attached with some tweaks of my own.

Full disclosure: the exploration that led Jim to this patch idea was
done with help from an AI assistant (Kiro); the idea, the benchmarking,
and the final shape of the patch are human-driven, but I wanted to be
up front about how his investigation started. Happy to discuss that
separately if people want to.

The one-line summary: instead of advancing nextVictimBuffer one buffer
at a time via pg_atomic_fetch_add_u32, each backend claims a batch of
64 consecutive buffer IDs from the shared hand and then iterates them
privately. Global sweep order is preserved -- every buffer is still
visited exactly once per complete pass -- but the atomic contention on
that one cache line drops by roughly the batch size.

Why this matters
----------------

On multi-socket boxes under eviction pressure, every backend that needs
a victim buffer ends up CAS'ing the same cache line. On a single
socket, a locked RMW on that cache line stays warm in L1/L2 and
completes in ~20ns. On 2+ sockets, the line bounces over QPI/UPI at
~100-200ns per op, and with hundreds of backends running
StrategyGetBuffer() concurrently, the line ping-pongs constantly. It's
a textbook NUMA scalability bottleneck, and once shared_buffers is
smaller than the working set and the sweep is running continuously,
that single atomic is what you hit in a perf profile (elevated
bus-cycles, cache-misses on the cache line holding nextVictimBuffer).

Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas
called it out in the "Adding basic NUMA awareness" thread [1] -- so
this isn't news to anyone who's been looking at this area. What I
think is new is a fix that's just this, without any of the surrounding
architectural change.

The framing (credit to Jim): the clock hand is doing two jobs. It
*coordinates* backends so they don't redundantly decrement usage_count
on the same buffers and so they eventually visit every buffer in the
pool exactly once per pass. It also *serializes* access to the
counter. Coordination is the part we want. Serialization is the part
that's killing us on bigger NUMA boxes. Batching keeps the
coordination and thins out the serialization.

How it works
------------

Two per-backend statics, MyBatchPos and MyBatchEnd. When a backend
calls ClockSweepTick() and its local batch is exhausted, it does a
single fetch-add of CLOCK_SWEEP_BATCH_SIZE (64) against
nextVictimBuffer and now owns that range. Subsequent ticks just bump
the local counter.

Wraparound got a small rewrite. The original code had the backend that
crossed NBuffers drive completePasses++ under the spinlock via a CAS
loop. With batching, multiple backends can each land a fetch-add that
returns a value >= NBuffers in the same pass, so the logic now is:
whoever sees a start >= NBuffers takes the spinlock, re-reads the
counter, and if it's still out of range does a single CAS to wrap it
and bumps completePasses. If somebody else already wrapped, we just
release and move on. StrategySyncStart() still sees a consistent
(nextVictimBuffer, completePasses) pair.

The batch size is gated on whether we actually have multiple NUMA
nodes. On a single-socket box the atomic is already socket-local,
batching just makes backends skip further ahead than they need to, so
we fall back to batch size 1 -- which is bit-for-bit the original
behavior. The guard:

if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
else
ClockSweepBatchSize = 1;

Min() against NBuffers covers the small-shared_buffers corner so a
batch never wraps the pool multiple times in one claim.

Thinking more about this approach led me to believe that this non-NUMA default is wrong and induces overhead for a very common case.

Does batching mess up the meaning of usage_count?
--------------------------------------------------

Short answer: no. I want to walk through this because it was my first
concern too, and I think it's the question that will come up most on
review.

The clock sweep's usage_count is an access-frequency approximation
measured in units of *complete passes*. A buffer with usage_count = N
survives N passes without a re-pin. The semantic meaning lives at pass
granularity, not at individual-buffer granularity.

What batching changes: intra-pass temporal ordering. Without batching,
with N backends sweeping, decrements are interleaved -- backend A hits
B[0], backend B hits B[1], backend C hits B[2]. With batching, backend
A hits B[0..63] in a tight local burst, then backend B hits B[64..127],
etc. The 64-buffer chunks are decremented in bursts rather than
individually.

Why it doesn't matter:

1. Every buffer still gets decremented exactly once per complete
pass. The invariant the algorithm actually depends on is
untouched.

2. A buffer's survival window is the time between consecutive
passes. That's milliseconds to seconds under load. Whether
B[0] gets decremented 50us before or 50us after B[63] within
the same pass is below the resolution of anything usage_count
is trying to measure.

3. The bgwriter's feedback loop reads (nextVictimBuffer,
completePasses, numBufferAllocs) via StrategySyncStart() every
~200ms. nextVictimBuffer still advances at the same *total*
rate (64 per atomic op, but atomic ops happen 1/64 as often).
The position it reports can jitter by up to 64 buffers relative
to the one-at-a-time case, but BgBufferSync()'s smoothed
estimates operate over thousands of buffers per cycle, so the
jitter disappears into the averaging. numBufferAllocs still
increments once per allocation. strategy_delta,
smoothed_alloc, smoothed_density, reusable_buffers_est -- all
unaffected in any way I can see.

Table form, because it's easier to argue with:

Property | Unpatched | Batched
----------------------------------+----------------+----------------
Buffers visited per pass | NBuffers | NBuffers
Decrements per buffer per pass | 1 | 1
Eviction threshold | usage_count==0 | usage_count==0
Max survival (passes) | 6 | 6
Decrement ordering within a pass | interleaved | chunked
bgwriter allocation rate signal | accurate | accurate
Cross-socket atomic traffic | 1 per buffer | 1 per 64

There is one subtle difference worth naming. When a backend finds a
victim at B[5] of its batch, it returns with MyBatchEnd still sitting
at B[63]. The next time that backend needs a victim it resumes at
B[6], not at wherever the global hand now points. So the backend
drains its batch over multiple StrategyGetBuffer() calls rather than
all at once. Under heavy load, where batches are consumed in
microseconds, this is invisible. Under light load, the implication is
that some buffers can sit with slightly stale usage_count for longer
than they would have before. But "light load" means "the sweep is
barely moving and nothing wants to evict anyway" -- so the effect
doesn't show up where it would hurt.

There's also a small positive side-effect: cache locality. The backend
that just touched BufferDescriptor[B[0]] has the adjacent descriptors
warm in L1/L2. Walking B[0..63] locally is cheaper than walking a
striped interleaving where each descriptor was last touched by a
different core. I haven't tried to isolate this in perf, but it falls
out naturally.

Benchmarks
----------

Jim ran these; I'm still working on reproducing them locally and will
post independent numbers in a follow-up. All bare metal, Linux, huge
pages enabled throughout (more on that below), postmaster pinned to
node 0 with `numactl --cpunodebind=0` because otherwise stock TPS
varied from 31K to 40K depending on which node the postmaster happened
to land on at launch -- worth flagging for anyone trying to reproduce.

Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
working set always spills and the sweep is hot.

r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):

pgbench RO:
Clients Stock Patched Delta
64 31,457 36,353 +16%
128 31,678 37,864 +20%
256 31,510 37,558 +19%
384 31,431 37,464 +19%
512 31,329 37,040 +18%

pgbench RW:
Clients Stock Patched Delta
64 7,685 7,713 0%
128 10,420 10,541 +1%
256 12,393 12,463 +1%
384 15,317 15,197 -1%
512 17,930 17,978 0%

m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
RO +19-20%, RW within noise.

c8i.metal-48xl (192 vCPUs, 1 socket):
Single-socket -> batch_size=1 -> original code path. No
behavioral change. (I double-checked this one specifically
because it's the sanity test for the gate.)

HammerDB TPC-C on m6i.metal (1000 warehouses):
VUs Stock Patched Delta
128 358,518 349,787 -2%
256 332,098 330,272 -1%
384 365,782 377,519 +3%
512 370,663 386,526 +4%

No TPC-C regression, which was the thing we were most worried about. An
earlier attempt (per-socket partitioned sweep, see below) was -13% on
this same workload.

The general shape is: the scaling curve flattens later. Unpatched, TPS
tops out around 128 clients and stays flat up to 512 because backends
are spending cycles waiting on the cache line rather than
doing work. Patched, the curve keeps rising past the point where
unpatched plateaus.

Huge pages caveat: all of the above was run with huge pages on, on
large-memory instances (the r8i.96xl has 3TB, so Jim never considered
running without them). We have not characterized the non-huge-pages
case. That's on my list; I don't expect it to change the conclusion,
but I shouldn't speak for data I haven't collected.

Relationship to Tomas's NUMA series
-----------------------------------

Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer
interleaving across nodes, partitioned freelists, partitioned clock
sweep, PGPROC interleaving, and related pieces. I want to be careful
here because I don't think we should frame this patch as competing with
that work.

One thing I found striking as I re-read the thread: in the benchmarks
Tomas posted later in the series, *most of the benefit comes from
partitioning the clock sweep*, and the NUMA memory-placement layer on
top sometimes runs slower than partitioning alone. His own conclusion,
quoted roughly: the benefit mostly comes from just partitioning the
clock sweep, and it's largely independent of the NUMA stuff; the NUMA
partitioning is often slower.

That observation is the thing that makes me think batching is worth
considering on its own. It's going after the same bottleneck Tomas's
partitioning addresses, but:

- without splitting global eviction visibility (which is where
cross-partition stealing gets complicated),
- without requiring NUMA-aware buffer placement (which has huge
page alignment, descriptor-partition-mid-page, and resize
complications that are still being worked out in that thread),
- without touching PGPROC or bgwriter.

What this patch does *not* do:
- place buffers on specific NUMA nodes
- partition the freelist
- touch PGPROC
- add new GUCs
- change bgwriter

What this patch *does* do:
- target exactly the clock-sweep contention that Tomas's
partitioning targets, and reduce it by ~64x, in ~30 lines.

If Tomas's series lands in full, this patch becomes redundant for its
primary use case (though even within a partitioned sweep, the
per-partition atomic still benefits from batching, so it's arguably a
useful primitive either way). If Tomas's series lands incrementally
over several cycles -- which the open items in that thread suggest is
the realistic path -- this gets us a real chunk of the multi-socket win
now.

This patch is also orthogonal to my earlier thread about removing the
freelist entirely [2], but given the proximity to that code Jim agreed
that I could propose/steward it here on the list for consideration.

Open questions / things I'd like feedback on
--------------------------------------------

- Batch size. 64 is a round number that worked well in testing, but
Nathan raised the reasonable point that on small shared_buffers
with high concurrency, a fixed 64 could be unfortunate. Options:
scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
with max_connections, keep it fixed but let operators tune it, or
make it a function of NUMA node count. I don't have a strong
opinion yet; the Min(batch, NBuffers) cap covers the "obviously
wrong" corner but doesn't speak to the "several hundred backends
on a few-MB shared_buffers" shape. Numbers/ideas/proposals welcome.

- NUMA detection. The gate uses pg_numa_init() /
pg_numa_get_max_node(). On systems where libnuma isn't available,
or where get_mempolicy is blocked (some container configurations),
we fall back to batch size 1. That's safe but it misses the
"single socket, many cores, still benefits from fewer atomics"
case. Might be worth a way to force-enable, or batching on all
systems with a smaller batch size when single-socket. I'd like to
measure before deciding.

- Eviction pattern on reads. Nathan also flagged that with batching,
the buffers a backend ends up pinning in one StrategyGetBuffer()
call will tend to be contiguous in buffer-id space rather than
scattered, which is a different allocation pattern than today.
The usage_count analysis above says this is benign, but if anyone
has an intuition for a workload where this would be observable
(e.g., something that cares about the mapping between buffer-id
and relation locality), I'd like to hear it.

- nextVictimBuffer wraparound. The current code has a mild overflow
concern papered over with "highly unlikely and wouldn't be
particularly harmful". With batching this is no worse than before,
but if we're already touching this function, it might be worth
thinking about whether to tighten it up in the same patch or a
follow-up.

- Should the non-NUMA value for this be derived from core counts that
imply L1/L2 cache layouts or simply default to 8 rather than 1 to
realize some benefit?

So, I'm answering my own question here. Yes, it should. Ideas below.

- Should there be a postgresql.conf setting for this that takes
precedence?

I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at
the small-shared_buffers corner, and post perf stat output showing the
atomic/cache-miss deltas over the next few days. In the meantime,
eyeballs and skepticism welcome -- I would especially welcome comments
from Andres, who's been in this code recently, and from Tomas, whose
series has the most overlap.

I realize that we're past feature freeze and working on release notes
for v19, so the chances of merging this are slim to none. I think this
could be considered a "performance bug fix for NUMA systems" in this
release, but that is stretching it a bit. It is a big ask at this
stage to land a change like this.

best.

-greg

[1]
/messages/by-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
[2]
/messages/by-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me
Attachments:
* v1-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch

ComputeClockBatchSize() has two phases: select a base batch from hardware topology, then cap it to prevent over-claiming.

Phase 1: Base batch from topology

int ncpus = pg_get_online_cpus();
int numa_nodes = (pg_numa_init() != -1) ? pg_numa_get_max_node() + 1 : 1;

if (numa_nodes > 1) base_batch = 64;
else if (ncpus > 16) base_batch = 32;
else if (ncpus > 8) base_batch = 16;
else if (ncpus > 4) base_batch = 8;
else base_batch = 1;

The reasoning at each tier:

- NUMA (multi-socket): Atomic ops cross the interconnect (QPI/UPI/Infinity
Fabric). Round-trip latency is ~100-300ns vs ~10-40ns intra-socket.
Batch=64 amortizes that heavily.

- >16 cores, single socket: Still significant L3 contention, many cores
competing for the same cache line. Batch=32 cuts atomic ops by 32x.

- 9-16 cores: Moderate contention. Batch=16.

- 5-8 cores: Light contention. Batch=8.

- <=4 cores: Almost no contention. Batch=1 (no batching). The overhead of
batching logic isn't worth it, and there's a fairness tradeoff - batching
means one backend "owns" a range of buffers temporarily, which matters
more when there are few buffers per backend.

Phase 2: Cap to prevent over-claiming

max_batch = (MaxBackends > 0)
? pool_nbuffers / (2 * MaxBackends)
: pool_nbuffers / 200;
if (max_batch < 1)
max_batch = 1;

return Min(base_batch, Min(max_batch, pool_nbuffers));

The cap ensures that if every backend simultaneously claims a batch, the total claimed doesn't exceed half the pool:

batch_size * MaxBackends <= pool_nbuffers / 2

Why half? If all backends claimed the entire pool simultaneously, they'd each be sweeping overlapping ranges, thus wasting work and defeating the purpose. Keeping total claims under 50% of the pool means at any instant, at most half the buffers are "in flight" being evaluated by backends, and the other half are available for normal operation.

For a small dynamic pool (say 4096 buffers with MaxBackends=200), the cap computes to 4096 / 400 = 10, which overrides any larger base_batch. For the default pool with shared_buffers = 8GB (1M buffers) and MaxBackends=200, the cap is 1000000 / 400 = 2500 which is well above the max base_batch of 64, so the base_batch wins.

The pool_nbuffers floor at the end handles the degenerate case of a pool smaller than the batch size.

The Tradeoff

Larger batches reduce atomic contention but increase sweep unevenness, one backend might sweep through "cold" buffers while another's batch happens to land on "hot" ones. The tiered approach balances this: batch aggressively only when the hardware topology makes contention the dominant cost (NUMA, many-core), and stay conservative on small systems where fairness matters more.

I think this is better because:

1. The original patch only batched on multi-socket NUMA systems. The new algorithm also provides atomic contention benefits on large single-socket systems (>16 cores) where L3 cache contention matters.

2. Conservative on small systems: Systems with ≤4 cores get batch_size=1 (original behavior) since batching overhead outweighs contention benefits and fairness matters more.

3. Prevents pathological over-claiming: The cap mechanism prevents scenarios where many backends claim huge batches relative to a small buffer pool.

Based on the algorithm, here's what different systems would get:

System CPUs NUMA Total RAM Shr Buf Batch Size Atomic Reduction
================== ==== ======== =========== ======== ========== ================
r8i.metal-96xl 384 multi 3072GB 2457.6GB 64 64x
m6i.metal 128 multi 512GB 409.6GB 64 64x
c8i.metal-48xl 192 1 socket 192GB 153.6GB 32 32x
Large server 64 multi 256GB 204.8GB 64 64x
Medium server 32 1 socket 64GB 51.2GB 32 32x
Small server 16 1 socket 32GB 25.6GB 16 16x
Developer machine 8 1 socket 16GB 12.8GB 8 8x
Small VM 4 1 socket 4GB 3.2GB 1 no change
Overloaded VM 8 1 socket 4GB 3.2GB 8 8x

best.

-greg

Andres Freund

andres@anarazel.de

1 day ago

In reply to: Greg Burd (#1)

Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention

Hi,

Thanks for looking into this.

On 2026-04-25 16:08:02 -0400, Greg Burd wrote:

Does batching mess up the meaning of usage_count?
--------------------------------------------------

Short answer: no. I want to walk through this because it was my first
concern too, and I think it's the question that will come up most on review.

The clock sweep's usage_count is an access-frequency approximation measured
in units of *complete passes*. A buffer with usage_count = N survives N
passes without a re-pin. The semantic meaning lives at pass granularity,
not at individual-buffer granularity.

What batching changes: intra-pass temporal ordering. Without batching, with
N backends sweeping, decrements are interleaved -- backend A hits B[0],
backend B hits B[1], backend C hits B[2]. With batching, backend A hits
B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The
64-buffer chunks are decremented in bursts rather than individually.

Why it doesn't matter:

1. Every buffer still gets decremented exactly once per complete
pass. The invariant the algorithm actually depends on is
untouched.

2. A buffer's survival window is the time between consecutive
passes. That's milliseconds to seconds under load. Whether
B[0] gets decremented 50us before or 50us after B[63] within
the same pass is below the resolution of anything usage_count
is trying to measure.

I don't think this is true, unfortunately. Sure, if you have a completely
uniform, IO intensive, workload it is, but that's not all there is. If you
have a bunch of connections that replace buffers at a low rate and a bunch of
connections that do so at a high rate, the batches "checked out" by the "low
rate" connections won't be processed soon. Thus the buffers in that batch
won't have their usagecount decremented and thus have a stronger "protection"
against replacement.

You can argue that may be OK, because it'd be unlikely that the next sweep
would assign the same buffers to an "low rate" backend again. But that'd be an
argument you'd have to make and validate.

I'm somewhat doubtful that batching that's independent of contention and
independent of the usage rate will work out all that well. If you instead
went for a partitioned sweep architecture, with balancing between the
different partitions, you don't have that issue. And you have a building block
for more numa awareness etc.

There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
doesn't show up where it would hurt.

As mentioned above, this assumes that the replacement rate is uniform between
backends, which I think is not uniformly true outside of benchmarks.

There's also a small positive side-effect: cache locality. The backend that
just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in
L1/L2.

A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent
cacheline prefetching you'll have *maybe* 2 consecutive BufferDescs in L1/L2.
Where it might help more is the TLB.

Benchmarks
----------

Jim ran these; I'm still working on reproducing them locally and will post
independent numbers in a follow-up. All bare metal, Linux, huge pages
enabled throughout (more on that below), postmaster pinned to node 0 with
`numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K
depending on which node the postmaster happened to land on at launch --
worth flagging for anyone trying to reproduce.

That's an odd one that I think you need to investigate separately.

Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
working set always spills and the sweep is hot.

Uhm, is this something worth optimizing substantially for? What you're
measuring here is basically the worst possible way of configuring a database,
with full double buffering and a lot of memory bandwidth dedicated to copying
buffers from one place to another. That's maybe a sane setup if you have a lot
of small databases that you can't configure individually, but that's not the
case when you run a reasonably large workload on a 384vCPU setup.

I think to be really convincing you'd have to do this with actual IO involved
somewhere.

Relationship to Tomas's NUMA series
-----------------------------------

Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.

One thing I found striking as I re-read the thread: in the benchmarks Tomas
posted later in the series, *most of the benefit comes from partitioning the
clock sweep*, and the NUMA memory-placement layer on top sometimes runs
slower than partitioning alone. His own conclusion, quoted roughly: the
benefit mostly comes from just partitioning the clock sweep, and it's
largely independent of the NUMA stuff; the NUMA partitioning is often
slower.

That was partially because he measured on something that didn't really have
significant NUMA effects though...

That observation is the thing that makes me think batching is worth
considering on its own. It's going after the same bottleneck Tomas's
partitioning addresses, but:

- without splitting global eviction visibility (which is where
cross-partition stealing gets complicated),

You *are* doing that tho.

- without requiring NUMA-aware buffer placement (which has huge
page alignment, descriptor-partition-mid-page, and resize
complications that are still being worked out in that thread),

You can do the partitioned clock sweep without *any* of that.

Greetings,

Andres Freund

Greg Burd

greg@burd.me

about 15 hours ago

In reply to: Andres Freund (#3)

Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention

On Mon, Apr 27, 2026, at 10:15 AM, Andres Freund wrote:

Hi,

Thanks for looking into this.

On 2026-04-25 16:08:02 -0400, Greg Burd wrote:

Does batching mess up the meaning of usage_count?
--------------------------------------------------

Short answer: no. I want to walk through this because it was my first
concern too, and I think it's the question that will come up most on review.

The clock sweep's usage_count is an access-frequency approximation measured
in units of *complete passes*. A buffer with usage_count = N survives N
passes without a re-pin. The semantic meaning lives at pass granularity,
not at individual-buffer granularity.

What batching changes: intra-pass temporal ordering. Without batching, with
N backends sweeping, decrements are interleaved -- backend A hits B[0],
backend B hits B[1], backend C hits B[2]. With batching, backend A hits
B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The
64-buffer chunks are decremented in bursts rather than individually.

Why it doesn't matter:

1. Every buffer still gets decremented exactly once per complete
pass. The invariant the algorithm actually depends on is
untouched.

2. A buffer's survival window is the time between consecutive
passes. That's milliseconds to seconds under load. Whether
B[0] gets decremented 50us before or 50us after B[63] within
the same pass is below the resolution of anything usage_count
is trying to measure.

I don't think this is true, unfortunately. Sure, if you have a completely
uniform, IO intensive, workload it is, but that's not all there is. If you
have a bunch of connections that replace buffers at a low rate and a bunch of
connections that do so at a high rate, the batches "checked out" by the "low
rate" connections won't be processed soon. Thus the buffers in that batch
won't have their usagecount decremented and thus have a stronger "protection"
against replacement.

You can argue that may be OK, because it'd be unlikely that the next sweep
would assign the same buffers to an "low rate" backend again. But that'd be an
argument you'd have to make and validate.

You're right, and my "why it doesn't matter" section overstated things. The uniform-workload assumption was sloppy. Let me try again with the mixed case in mind.

The scenario I think you're describing: a low-rate backend claims B[0..63], finds a victim at B[5], and then doesn't call StrategyGetBuffer() again for a while -- maybe seconds. During that
time B[6..63] sit with their current usage_count, undecremented, while high-rate backends are sweeping the rest of the pool at full speed. Those 58 buffers get a free pass they wouldn't have gotten in the interleaved case.

I can bound the effect but not dismiss it. Each slow backend can hold at most 64 undecremented buffers. With 32GB shared_buffers (~4.2M buffers), 100 slow backends each holding a full batch means ~6,400 buffers delayed -- 0.15% of the pool. The delay lasts until the backend's next StrategyGetBuffer() call. So the question is whether 0.15% of buffers with temporarily stale usage_count produces a measurable eviction quality difference.

Two observations that bound it further:

1. In the current code, a backend only calls ClockSweepTick() when
it needs a victim. A low-rate backend barely moves the global
hand at all. Without batching, buffers at positions beyond the
current hand are also "undecremented" -- they just haven't been
reached yet. Batching changes *which* specific 64 buffers are
pending, not the total count of undecremented buffers in the
pool at any instant.

2. The buffers in a held batch are contiguous in buffer-ID space.
Since buffer-ID assignment to relation blocks is effectively
random (driven by eviction order), those 64 buffers are
scattered across relations. There's no systematic bias toward
protecting hot or cold data -- it's a random sample.

That said, "bounded and random" isn't the same as "zero." One mitigation that's simple to implement: abandon the remaining batch at transaction boundaries. Something like resetting MyBatchEnd = MyBatchPos in AtEOXact or equivalent, so a backend that goes idle between transactions doesn't hold a stale batch across an idle period. That limits the staleness window to the duration of a single transaction, which is when the backend is actively doing work and likely to consume the batch quickly anyway.

I'd like to measure the mixed-workload case directly. A benchmark with e.g. 50 backends doing heavy sequential scans and 50 doing single-row OLTP, and compare eviction hit rates with and without the patch. Would that be the kind of validation you'd want to see?

I'm somewhat doubtful that batching that's independent of contention and
independent of the usage rate will work out all that well. If you instead
went for a partitioned sweep architecture, with balancing between the
different partitions, you don't have that issue. And you have a building block
for more numa awareness etc.

I agree that partitioned sweep is architecturally more principled and gives you a foundation for deeper NUMA work. I'm not arguing that batching is a better long-term architecture.

The pragmatic case for batching is: it's ~30 lines, it addresses the identified bottleneck, and it doesn't foreclose on partitioned sweep later. If partitioned sweep lands, batching becomes redundant for its primary use case. If partitioned sweep lands incrementally -- which the open items in that thread suggest is the realistic path -- this gets a chunk of the multi-socket win into users' hands sooner.

One concrete difference: partitioned sweep needs a stealing mechanism for correctness when partitions are unevenly loaded. Batching avoids that because the "partitions" are ephemeral (one batch cycle) and sequential (global order preserved), so there's no long-lived imbalance to steal from. Whether that simplicity is worth the tradeoff you identified above is a judgment call, and I take your point that the building-block argument favors partitioning.

I'm also not attached to "batching instead of partitioning." If you think the right move is to focus effort on partitioned sweep, I'm happy to help with that. But if there's appetite for a smaller change that ships sooner, this is what I've got.

There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
doesn't show up where it would hurt.

As mentioned above, this assumes that the replacement rate is uniform between
backends, which I think is not uniformly true outside of benchmarks.

There's also a small positive side-effect: cache locality. The backend that
just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in
L1/L2.

A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent
cacheline prefetching you'll have *maybe* 2 consecutive BufferDescs in L1/L2.
Where it might help more is the TLB.

You're right, I overstated the L1/L2 argument. At 64 bytes per descriptor, adjacent cacheline prefetch gets you at most 2 consecutive descriptors, not 64. TLB is the more plausible benefit -- the batch walks a contiguous virtual address range, which should reduce TLB misses when the descriptor array spans multiple pages. I haven't tried to isolate this in perf and won't claim it until I have numbers.

Benchmarks
----------

Jim ran these; I'm still working on reproducing them locally and will post
independent numbers in a follow-up. All bare metal, Linux, huge pages
enabled throughout (more on that below), postmaster pinned to node 0 with
`numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K
depending on which node the postmaster happened to land on at launch --
worth flagging for anyone trying to reproduce.

That's an odd one that I think you need to investigate separately.

Agreed. I'll investigate and report separately. My working hypothesis is that it's related to where shared memory gets physically allocated relative to the postmaster's NUMA node, which then affects all child backends. That's interesting regardless of this patch.

Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
working set always spills and the sweep is hot.

Uhm, is this something worth optimizing substantially for? What you're
measuring here is basically the worst possible way of configuring a database,
with full double buffering and a lot of memory bandwidth dedicated to copying
buffers from one place to another. That's maybe a sane setup if you have a lot
of small databases that you can't configure individually, but that's not the
case when you run a reasonably large workload on a 384vCPU setup.

I think to be really convincing you'd have to do this with actual IO involved
somewhere.

Fair criticism. The pgbench setup was designed to isolate the clock sweep bottleneck by keeping everything in the OS page cache, but you're right that it doesn't represent how someone would actually run a database on a 384-vCPU box. In production you'd either size shared_buffers to hold the working set (no sweep pressure) or have real storage I/O (where I/O latency dilutes sweep contention).

The HammerDB TPC-C numbers (which involve I/O and realistic contention patterns) show flat-to-slightly-positive -- no regression, small win at higher concurrency. I think that's the more honest picture of what production looks like. And perhaps "flat to slightly-positive" delta might not be enough juice for the squeeze, especially this late in a cycle.

For the follow-up I'll run:

- Working set 2-3x shared_buffers on NVMe, so StrategyGetBuffer()
calls actually hit storage on some fraction of evictions.
- A mixed OLTP workload (not just pgbench -S) with varied access
patterns, to address the uniform-workload concern above.
- perf stat showing bus-cycles, cache-misses, and L3 contention
deltas, so the mechanism is visible independent of TPS.

I should have led with the TPC-C results and framed the pgbench numbers as "here's where the ceiling is under maximum sweep pressure" rather than presenting them as the headline result.

Relationship to Tomas's NUMA series
-----------------------------------

Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.

One thing I found striking as I re-read the thread: in the benchmarks Tomas
posted later in the series, *most of the benefit comes from partitioning the
clock sweep*, and the NUMA memory-placement layer on top sometimes runs
slower than partitioning alone. His own conclusion, quoted roughly: the
benefit mostly comes from just partitioning the clock sweep, and it's
largely independent of the NUMA stuff; the NUMA partitioning is often
slower.

That was partially because he measured on something that didn't really have
significant NUMA effects though...

Fair point. I shouldn't over-generalize from benchmarks run on hardware that wasn't exercising the NUMA dimension. Retracted.

That observation is the thing that makes me think batching is worth
considering on its own. It's going after the same bottleneck Tomas's
partitioning addresses, but:

- without splitting global eviction visibility (which is where
cross-partition stealing gets complicated),

You *are* doing that tho.

You're right. When a backend holds B[0..63], those buffers are effectively invisible to other backends for eviction consideration until the batch is consumed. That is a form of split visibility.

The difference from a permanent partition is that the split is short-lived (one batch consumption cycle, microseconds under load) and sequential (the next batch picks up where this one left off in the global order). There's no long-lived assignment of buffer ranges to backends, so the kind of structural imbalance that drives the need for cross-partition stealing doesn't arise. But I shouldn't have claimed "without splitting." The honest framing is: "with much more limited
and transient splitting."

- without requiring NUMA-aware buffer placement (which has huge
page alignment, descriptor-partition-mid-page, and resize
complications that are still being worked out in that thread),

You can do the partitioned clock sweep without *any* of that.

Also correct. The complications I listed are from Tomas's patches 0001 and 0006 (memory interleaving and NUMA-aware buffer-to-node mapping), not from the clock-sweep partitioning patches 0002-0005. Partitioned clock sweep alone doesn't require NUMA-aware buffer placement. I
conflated the two; apologies.

Greetings,

Andres Freund

To summarize the open items I'm taking away:

1. Mixed-workload benchmark (high-rate + low-rate backends) to
measure eviction quality impact of held batches.
2. I/O-inclusive benchmarks on NVMe with working set > shared_buffers.
3. Investigate the postmaster NUMA placement variance separately.
4. Consider batch-abandonment at transaction boundaries as a
mitigation for the staleness concern.
5. perf stat data showing the mechanism (bus-cycles, cache-misses).

I'll post results as I have them. I greatly appreciate your time and thoughtful review.

best.

-greg